Important packages:

setwd("Dataset")
Warning: The working directory was changed to C:/Users/DELL/Documents/GitHub/Project_IT326/Dataset inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
Dataset <- read.csv("data.csv")
Preprocessed_dataset <-  read.csv("preprocessed_dataset.csv") 
if(!require(ggplot2)){
install.packages("ggplot2")}
library(ggplot2)
if(!require(dplyr)){
install.packages("dplyr")}
library(dplyr) 

if(!require(ltm)){
install.packages("ltm")}
library(ltm)
if (!require(cluster, quietly = TRUE)) {
  install.packages("cluster")
}
if (!require(factoextra, quietly = TRUE)) {
  install.packages("factoextra")
}
 
if (!require(caret, quietly = TRUE)) {
  install.packages("caret")
}
if (!require(party, quietly = TRUE)) {
  install.packages("party")
}
if (!require(partykit, quietly = TRUE)) {
  install.packages("partykit")
}

if (!require(rpart, quietly = TRUE)) {
  install.packages("rpart")
}
if (!require(rpart.plot, quietly = TRUE)) {
  install.packages("rpart.plot")
}
if (!require(tidyverse, quietly = TRUE)) {
  install.packages("tidyverse")
}

if (!require(RWeka, quietly = TRUE)) {
  install.packages("RWeka")
}

IT326 Project

_________________________________________________________________________________________________________

The Goal and problem

Our primary objective of this analysis is to classify whatever a student will go to college or not we aim to discover the most influential variables and their relationships with the decision not to attend college to provide counseling session to the student to help them

Data mining Task

using the classification methods and to identify the main factors and reasons why students are less likely to pursue higher education using class label indicated “will_go_to_college” being ‘False’. By leveraging the provided dataset with attributes such as school type, school accreditation, gender, interest in college, residence and use clustering to group pattern of the student who will go to collage or not

The Source

Kaggle.com

General information

  • Number of attributes : 11
  • Number of rows (objects) : 1000
  • The class label: The class label in our project is the attribute “will_go_to_college”. This attribute is binary, meaning that it can take on two values: True for yes or False for no. The value of this attribute will be the target variable that we are trying to predict during our project.

Attribute Description Type Possible values
type_school The type of school the student attends Binary Academic / Vocational
school_accreditation The quality if school Binary A / B (A is better than B)
gender The student’s gender Binary Male / Female
interest The student’s interest in going to college Nominal Very interested /Interested / Less Interested / Not Interested /Uncertain
residence The student’s residence Binary Urban / Rural
parent_age The age of the student’s parents Numeric 40 - 65
parent_salary

The monthly salary of the student’s parents in IDR/Rupiah.

[1Rupiah = 0.00024SAR]

Numeric 1000K - 10M
house_area The size of the student’s house in meter square Numeric 20 - 120
average_grades The student’s average grades in school. Numeric 75 - 98
parent_was_in_college Whether the student’s parents attended college. Binary True - False
will_go_to_college Whether the student plans to go to college. Binary True - False

Sample of our data

head(Dataset)

Here are a sample of 6 row from our dataset.

Missing values

sum(is.na(Dataset))
[1] 0

There are no missing values in our dataset.

Statistical graphs

Graph 1:

df =data.frame(Dataset)
ggplot(data=df, aes(x = interest, fill = will_go_to_college)) +
  geom_bar() +
  scale_x_discrete(limits = c('Not Interested', 'Less Interested', 'Uncertain', 'Interested', 'Very Interested')) +
  labs(title = 'College interest vs College attendance ') +
  scale_fill_manual(values = c("True" = "antiquewhite2", "False" = "antiquewhite3")) +
  theme_minimal()

NA
NA

According to the graph, whether students are interested in going to college or not does not affect whether they actually end up attending Collage . There is a group of individuals who were interested in attending but did not receive acceptance, while others who were not interested were accepted.

Graph 2:


filtered_True =filter(Dataset, will_go_to_college == 'True')
filtered_False =subset(Dataset, will_go_to_college =='False')

ggplot() +
  geom_density(data = filtered_True, aes(x = average_grades, fill = "Going to College"), alpha = 0.5) +
  geom_density(data = filtered_False, aes(x = average_grades, fill = "NOT Going to College"), alpha = 0.5) +
  labs(x = "Average Grades", y = "Density") +
  ggtitle("Comparison of Average Grades for Students Going to College and NOT Going to College") +
  scale_fill_manual(values = c("Going to College" = "antiquewhite4", "NOT Going to College" = "antiquewhite1"))

The graph shows that the average grades for students who had accepted to go to college were higher than those who did not enter college , and this indicates the existence of a correlation between those who going to college and the average grades

Graph 3:

Dataset_percentage <- Dataset %>%
  group_by(type_school) %>%
  summarise(percentage = mean(will_go_to_college == "True") * 100)

# Create a percentage chart
ggplot(Dataset_percentage, aes(x = type_school, y = percentage, fill = type_school)) +
  geom_bar(stat = "identity") +
  labs(title = "Percentage of Students Going to College by Type of School",
       x = "Type of School",
       y = "Percentage") +
  scale_fill_manual(values = c("Academic" = "antiquewhite3", "Vocational" = "antiquewhite2")) +
  theme_minimal()

NA
NA

This graph shows the impact of the type of high school attended by students on their college attendance . Based on the bar chart:

  • among students from Academic high schools, 313 are going to college, and 296 are not.

  • among students from Vocational high schools, 187 are going to college, and 204 are not

These information tell us that a higher proportion of students from academic high schools are going to college compared to those from vocational schools which suggest that the type of school attended.

Graph 4:

ggplot(Dataset, aes(x = average_grades)) +
  geom_histogram(binwidth = 5, fill = "antiquewhite2", color = "antiquewhite4") +
  labs(title = "Distribution of students' grades",
       x = "Students' average grades",
       y = "Frequency") +
  theme_minimal()

This histogram show us that the majority of the students in the dataset are performing well since it seems like their grades are spanning between 75 and 98. This analysis will help us determine whether the academic performance level of students is a contributing factor to their college attendance or not.

Statistical Measures

  1. The student’s academic performance analysis:
summary(Dataset$average_grades)

The student grades in our dataset range from 75.00 to 98.00, with a median of 85.58 and an average of 86.10. This suggests that most students are doing well as none of them have average grades below 50. However, it’s interesting to note that some students have much higher or lower grades than the average, mainly due to the wide range of grades..

  1. The socioeconomic status of students’ families analysis:
summary(Dataset$parent_salary)

In the dataset, we’ve got students parents with salaries ranging from 1,000,000 to 10,000,000 IDR/Rupiah. The median salary is 5,440,000 IDR/Rupiah, and the average is 5,381,570 IDR/Rupiah. This data tells us that many parents in our dataset earn less than the average salary in Indonesia, which is 146,000,000 IDR. This suggests that quite a few students in our dataset come from families with limited finances. And this financial situation could certainly impact their ability to get through college.

Additionally we can utilize the house area attribute to gain a deeper understanding of the socioeconomic status of students’ families, where students with houses significantly larger than the mean might indicate a higher socioeconomic status, while those with houses considerably smaller than the mean might reflect a comparatively lower socioeconomic status. Based on the shown output, the house areas range from [20.00-120.00 ㎡]. The median house area is 75.50 ㎡ indicates that families with house areas around this value likely have moderate socioeconomic status with houses that neither very small nor very large.

  1. Understanding Parent Age Range and Variation in its Values
summary(Dataset$parent_age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  40.00   50.00   52.00   52.21   54.00   65.00 
SD=sd(Dataset$parent_age)
MeanAge=mean(Dataset$parent_age)
cat("coefficient of variation:",SD/MeanAge*100,"%")
coefficient of variation: 6.704771 %

This summary provides the range for age attribute [40,65] which indicates that all parent in middle age during this age parent have more concern about their children , the coefficient of variation= 6.7% which indicates lower variation ,and the value of attribute parent_agerare are relatively close to the mean overall 25% of them have an age below or equal to 50 , 75% have an age below or equal to 54 and the median value is 52

Preprocessing data

we aim to increase the quality of data and transforming the data to make it suitable for analysis

Outliers analysis

###parent age outliers
quartiles <- quantile(Dataset$parent_age, probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(Dataset$parent_age)

Lower <- quartiles[1] - 1.5*IQR
Upper <- quartiles[2] + 1.5*IQR 

data_no_outlier <- subset(Dataset, Dataset$parent_age > Lower & Dataset$parent_age < Upper)
dim(data_no_outlier)
[1] 957  11
###parent salary outliers
quartiles <- quantile(Dataset$parent_salary, probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(Dataset$parent_salary)

Lower <- quartiles[1] - 1.5*IQR
Upper <- quartiles[2] + 1.5*IQR 

data_no_outlier <- subset(data_no_outlier, data_no_outlier$parent_salary> Lower & data_no_outlier$parent_salary < Upper)
dim(data_no_outlier)
[1] 955  11
###averge grades outliers
quartiles <- quantile(Dataset$average_grades, probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(Dataset$average_grades)

Lower <- quartiles[1] - 1.5*IQR
Upper <- quartiles[2] + 1.5*IQR 

data_no_outlier <- subset(data_no_outlier, data_no_outlier$average_grades> Lower & data_no_outlier$average_grades < Upper)
dim(data_no_outlier)
[1] 944  11
###house area outliers
quartiles <- quantile(Dataset$house_area, probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(Dataset$house_area)

Lower <- quartiles[1] - 1.5*IQR
Upper <- quartiles[2] + 1.5*IQR 

data_no_outlier <- subset(data_no_outlier, data_no_outlier$house_area> Lower & data_no_outlier$house_area < Upper)

Founded_Outliers=data.frame(anti_join(Dataset,data_no_outlier))
Joining with `by = join_by(type_school, school_accreditation, gender, interest, residence, parent_age, parent_salary, house_area, average_grades, parent_was_in_college, will_go_to_college)`
print(Founded_Outliers)

After conducting data analysis and identifying outliers, our inspection reveals that the detected outliers represent inherent variation within the population. Regarding Parent_age, outliers are observed for values below 44 and above 65. However, it should be noted the age from 40 to 65 fall within the expected mean of our dataset meaning that it doesn’t indicate that they are outliers . For parent_salary, we found two outliers: one below 1,326,250 ind ≈ 85 USD and another above 9,416,250 ind ≈ 606 USD. The minimum and maximum values were determined to be 1,000,000 ind ≈ 64 USD and 10,000,000 ind ≈ 644 USD, respectively. In the case of grades, twelve outliers were identified, ranging from below 76 to above 97. Nevertheless, since the data falls within the acceptable range of 0 to 100, these outliers should be retained as they are still considered normal and within the usual grade range. Finally, for house_area, we found eleven outliers below 34.4m and above 115m, with the minimum being 20m and the maximum being 120m. However, these values are still considered typical for the population.

Normalization

normalize <- function(x) {return((x-min(x))/ (max(x)-min(x)))}
datasetWithoutNormalization<-Dataset
Dataset$parent_salary<-normalize(datasetWithoutNormalization$parent_salary)
Dataset$house_area<-normalize(datasetWithoutNormalization$house_area)
print(Dataset)

We applied normalization to the ‘parent_salary’ and ‘house_area’ attributes, scaling their values to a range between 0 and 1. This normalization process greatly facilitates data handling and analysis, ensuring that these attributes are on a consistent scale. Which will improve the reliability of our data analysis and enable better conclusions to be drawn from the dataset. Normalization is a crucial step in preparing the data for modeling, as it prevents attributes with larger numerical ranges from dominating the analysis and ensures fair treatment for all features.

Discretization



Dataset$average_grades [Dataset$average_grades >= 95] <- '+A'
Dataset$average_grades [95 >Dataset$average_grades & Dataset$average_grades >= 90] <- 'A'
Dataset$average_grades [90 >Dataset$average_grades & Dataset$average_grades >= 85] <- '+B'
Dataset$average_grades [85 >Dataset$average_grades & Dataset$average_grades >= 80] <- 'B'
Dataset$average_grades [80 >Dataset$average_grades & Dataset$average_grades >= 75] <- '+C'
Dataset$average_grades [75 >Dataset$average_grades & Dataset$average_grades >= 70] <- 'C'
Dataset$average_grades [70 >Dataset$average_grades & Dataset$average_grades >= 65] <- '+D'
Dataset$average_grades [65 >Dataset$average_grades & Dataset$average_grades >= 60] <- 'D'
Dataset$average_grades [60 >Dataset$average_grades & Dataset$average_grades >= 0] <- 'F'
Dataset$average_grades <- as.character(Dataset$average_grades )
print(Dataset)

We transformed the parent_age attribute into intervals by dividing the values to be fall on one of two possible interval labels with equal width which is(40,50],(50,60] by discretization the values well be simpler to classify or perform other methods that can help us later in our model.

and to better utilize and interpret the grades attributes for each student, we have converted the numeric grades into letter grades (A+, A, B+, B, C+, C, D+, D, F). This transformation was undertaken to focus on the general letter grade representation rather than the precise numerical values.

Encoding

Dataset$parent_was_in_college[Dataset$parent_was_in_college=="TRUE"]<-1
Dataset$parent_was_in_college[Dataset$parent_was_in_college=="True"]<-1
Dataset$parent_was_in_college[Dataset$parent_was_in_college=="FALSE"]<-0
Dataset$parent_was_in_college[Dataset$parent_was_in_college=="False"]<-0

Dataset$will_go_to_college[Dataset$will_go_to_college=="TRUE"]<-0
Dataset$will_go_to_college[Dataset$will_go_to_college=="True"]<-0
Dataset$will_go_to_college[Dataset$will_go_to_college=="FALSE"]<-1
Dataset$will_go_to_college[Dataset$will_go_to_college=="False"]<-1

Dataset$gender[Dataset$gender=="Female"]<-1
Dataset$gender[Dataset$gender=="Male"]<-0

Dataset$school_accreditation[Dataset$school_accreditation=="A"]<-1
Dataset$school_accreditation[Dataset$school_accreditation=="B"]<-0

Dataset$interest[Dataset$interest=="Very Interested"]<-4
Dataset$interest[Dataset$interest=="Interested"]<-3
Dataset$interest[Dataset$interest=="Less Interested"]<-2
Dataset$interest[Dataset$interest=="Not Interested"]<-1
Dataset$interest[Dataset$interest=="Uncertain"]<-0


Dataset$type_school[Dataset$type_school=="Academic"]<-1 
Dataset$type_school[Dataset$type_school=="Vocational"]<-0

Dataset$residence[Dataset$residence=="Urban"]<-1
Dataset$residence[Dataset$residence=="Rural"]<-0
print(Dataset)

Since encoding is an important step in data preprocessing that enables the use of categorical data in various data analysis and machine learning tasks, we encoded attributes like the ‘parent was in college’ attribute from (True, False) to (1, 0), and ‘will go to college’ from (True, False) to (0, 1). This encoding is carried out as we aim to predict the influencing factors. Additionally, we encoded the ‘gender’ attribute from (Female, Male) to (1, 0), ‘school accreditation’ from (A, B) to (1, 0), ‘type_school’ from (Academic, Vocational) to (1, 0), ‘residence’ from (Urban, Rural) to (1, 0), and ‘interest’ from (Very interested ,Interested , Less Interested , Not Interested ,Uncertain ) to (4,3,2, 1, 0) respectively. Encoding serves to simplify the data, reduce complexity, and enhance its suitability for modeling purposes.

Correlation analysis Chi square test for nominal attribute:


#1
C=chisq.test(Dataset$type_school , Dataset$will_go_to_college)
print(C)

    Pearson's Chi-squared test with Yates' continuity correction

data:  Dataset$type_school and Dataset$will_go_to_college
X-squared = 1.0751, df = 1, p-value = 0.2998
#2

C=chisq.test(Dataset$school_accreditation , Dataset$will_go_to_college)
print(C)

    Pearson's Chi-squared test with Yates' continuity correction

data:  Dataset$school_accreditation and Dataset$will_go_to_college
X-squared = 0.78513, df = 1, p-value = 0.3756
#3
C=chisq.test(Dataset$gender , Dataset$will_go_to_college)
print(C)

    Pearson's Chi-squared test with Yates' continuity correction

data:  Dataset$gender and Dataset$will_go_to_college
X-squared = 1.0249, df = 1, p-value = 0.3114
#4
C=chisq.test(Dataset$interest , Dataset$will_go_to_college)
print(C)

    Pearson's Chi-squared test

data:  Dataset$interest and Dataset$will_go_to_college
X-squared = 73.337, df = 4, p-value = 4.477e-15
#5
C=chisq.test(Dataset$residence , Dataset$will_go_to_college)
print(C)

    Pearson's Chi-squared test with Yates' continuity correction

data:  Dataset$residence and Dataset$will_go_to_college
X-squared = 0.016098, df = 1, p-value = 0.899
#6
C=chisq.test(Dataset$average_grades , Dataset$will_go_to_college)
print(C)

    Pearson's Chi-squared test

data:  Dataset$average_grades and Dataset$will_go_to_college
X-squared = 261.89, df = 4, p-value < 2.2e-16
#7
C=chisq.test(Dataset$parent_was_in_college , Dataset$will_go_to_college)
print(C)

    Pearson's Chi-squared test with Yates' continuity correction

data:  Dataset$parent_was_in_college and Dataset$will_go_to_college
X-squared = 2.1194, df = 1, p-value = 0.1454

All the attributes have X-square greater than the p-value which indicate a some association with the class label; therefore we reject the null hypothesis

we noticed for ‘interest’ and ‘average grade’ the analysis shows that X-square is much larger than p-value indicate the significant association of the two attributes with the decision of the student to go to the collage or not

Correlation coefficient analysis for numeric attribute:


biserial.cor(Dataset$parent_salary,Dataset$will_go_to_college, c("all.obs", "complete.obs"), level = 1)
[1] 0.4756928
biserial.cor(Dataset$house_area,Dataset$will_go_to_college, c("all.obs", "complete.obs"), level = 1)
[1] 0.4672669
biserial.cor(Dataset$parent_age,Dataset$will_go_to_college,c("all.obs", "complete.obs"), level = 1)
[1] 0.04287336
 
 

the analysis shows moderate correlation coefficient for parent salary and house area with the class label which indicate that they are relevant factors meaning that the higher the parent salary and the larger house area the higher probability for a student to enroll in a collage

where is the on other hand, the correlation coefficient for the parent age is very small which indicate that the parent age has little impact to the probability for student to enroll in a collage

Feature selection:

ultimately based on the analysis of the correlation that we conducted on the relationship of the dataset attributes with the class label, and the understanding of the data and the context of each attribute and potential relevance to the class label we decided to not delete any of the attribute

Classification:

The attribute selection measure is a heuristic technique employed to select the most optimal criteria for dividing a collection of data tuples into partitions that are as pure as possible. Several attribute selection measures are commonly utilized, such as Information Gain (ID3), Gain Ratio (C4.5), and Gini Index (CART). we will constraints for Each of these measures different partitions by dividing them in ratios of 50%-50%, 70%-30%, and 80%-20%.

factor the data



data<-Preprocessed_dataset
data$will_go_to_college <- as.factor(data$will_go_to_college)
data$residence <- as.factor(data$residence)
data$gender <- as.factor(data$gender)
data$parent_was_in_college <- as.factor(data$parent_was_in_college)
data$interest <- as.factor(data$interest)
data$type_school <- as.factor(data$type_school)
data$school_accreditation <- as.factor(data$school_accreditation)
data$average_grades <- as.factor(data$average_grades)

balanced or imbalanced


library(tidyverse) 
library(caret)
data$will_go_to_college<- as.numeric(data$will_go_to_college)
hist(data$will_go_to_college,col="coral")

prop.table(table(data$will_go_to_college))

  1   2 
0.5 0.5 

we want to confirm that the distribution between the two label data is not too much different. Because imbalanced datasets can lead to imbalanced accuracy.

Fortunately ,our data is balanced

70% training, 30% testing
set.seed(123)
ind=sample(2, nrow(data), replace=TRUE, prob=c(0.70 , 0.30))
train_data=data[ind==1,]
test_data=data[ind==2,]
dim(train_data)
[1] 705  11
dim(test_data)
[1] 295  11

We partition the data the data into (70% training, 30% testing). This result in 705 object in the training set and 295 object in the testing set.

information gain


library(party)
myFormula<- will_go_to_college~ + type_school+school_accreditation+gender+interest+residence+parent_age+parent_salary+house_area+average_grades+parent_was_in_college

dataset_ctree<-ctree(myFormula, data=train_data)
table(predict(dataset_ctree), train_data$will_go_to_college)
   
      0   1
  0 309  48
  1  37 311
plot(dataset_ctree,type="simple")

testPred <- predict(dataset_ctree, newdata = test_data)
result<-table(testPred, test_data$will_go_to_college )
library(caret)
co_result <- confusionMatrix(result,positive="1")
print(co_result)
Confusion Matrix and Statistics

        
testPred   0   1
       0 136  34
       1  18 107
                                          
               Accuracy : 0.8237          
                 95% CI : (0.7754, 0.8655)
    No Information Rate : 0.522           
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.6451          
                                          
 Mcnemar's Test P-Value : 0.03751         
                                          
            Sensitivity : 0.7589          
            Specificity : 0.8831          
         Pos Pred Value : 0.8560          
         Neg Pred Value : 0.8000          
             Prevalence : 0.4780          
         Detection Rate : 0.3627          
   Detection Prevalence : 0.4237          
      Balanced Accuracy : 0.8210          
                                          
       'Positive' Class : 1               
                                          

this matrix below show the evaluation model on testing data

Evaluation method Method
Accuracy 82.37%
Error rate 17.63%
Sensitivity 75.89%
Specificity 88.31%
Precision 85.60%

In this decision tree model, the average grade with the highest information gain is selected as the root attribute. This means that the average grade is considered the most informative feature for classifying the data.

To enhance the model’s predictive capability, additional splitting criteria are incorporated based on their information gain values. These criteria include type_school, interest, residence, parent_salary, and house_area. By considering these attributes, the model aims to capture more relevant information from the data and improve its classification accuracy.

The model then utilizes the house area and average grades as splitting criteria to separate the data into different branches of the tree in multiples levels

However, it’s worth noting that the decision tree in this case is considered to have a low level. This implies that the tree is relatively shallow and may not be able to capture intricate patterns or relationships present in the data. For example gender, parent_was_in_college, parent age, and school accreditation are not included in the tree due to limited availability of the data or their lower information gain values.

Gain ratio

library(RWeka)
C45Fit <- J48( will_go_to_college~.,data=train_data)
table(train_data$will_go_to_college, predict(C45Fit))
   
      0   1
  0 322  24
  1  27 332
plot(C45Fit,type = "simple")

testPred <- predict(C45Fit, newdata = test_data)
results <- confusionMatrix(testPred, test_data$will_go_to_college, positive = "1")
print(results)
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 128  32
         1  26 109
                                          
               Accuracy : 0.8034          
                 95% CI : (0.7534, 0.8472)
    No Information Rate : 0.522           
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.6053          
                                          
 Mcnemar's Test P-Value : 0.5115          
                                          
            Sensitivity : 0.7730          
            Specificity : 0.8312          
         Pos Pred Value : 0.8074          
         Neg Pred Value : 0.8000          
             Prevalence : 0.4780          
         Detection Rate : 0.3695          
   Detection Prevalence : 0.4576          
      Balanced Accuracy : 0.8021          
                                          
       'Positive' Class : 1               
                                          

this matrix below show the evaluation model on testing data

Evaluation method Method
Accuracy 80.34%
Error rate 19.66%
Sensitivity 77.30%
Specificity 83.12%
Precision 80.74%

In contrast to the previous decision tree, the current tree is more complex and includes all the attributes except for gender. This means that the decision tree takes into account a wider range of factors to make its classifications. The attribute with the highest Gain ratio, which measures the effectiveness of a split, is chosen as the root attribute. In this case, Parent salary is determined to have the highest Gain ratio and is selected as the root attribute.

The model then repeatedly uses average grades as the splitting criterion at different levels of the tree. This indicates that the average grades are considered crucial for classifying the data and further subdividing the branches.

Gini index

library(rpart)
library(rpart.plot)
cart_fit=rpart(will_go_to_college~., data=train_data, method="class",cp=0.008)


rpart.plot(cart_fit)


testPred<- predict(cart_fit,newdata=test_data,type="class")
results<- confusionMatrix(testPred,test_data$will_go_to_college,positive="1")
print(results)
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 125  31
         1  29 110
                                          
               Accuracy : 0.7966          
                 95% CI : (0.7461, 0.8411)
    No Information Rate : 0.522           
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.5922          
                                          
 Mcnemar's Test P-Value : 0.8973          
                                          
            Sensitivity : 0.7801          
            Specificity : 0.8117          
         Pos Pred Value : 0.7914          
         Neg Pred Value : 0.8013          
             Prevalence : 0.4780          
         Detection Rate : 0.3729          
   Detection Prevalence : 0.4712          
      Balanced Accuracy : 0.7959          
                                          
       'Positive' Class : 1               
                                          

this matrix below show the evaluation model on testing data

Evaluation method Method
Accuracy 79.66%
Error rate 20.34%
Sensitivity 78.01%
Specificity 81.17%
Precision 79.14%

In this particular partition of the decision tree, Parent salary was once again chosen as the root attribute due to its high Gain index.

Gender and residence, on the other hand, were not included in this partition of the tree. Without sufficient data or a significant information gain associated with these attributes, the model may not have enough evidence to utilize them for accurate classification.

It’s worth noting that gender was not included in this partition, and it is important to highlight that the accuracy of this specific model was relatively lower, at 79.66%.

80% training, 20% testing
set.seed(123)
ind=sample(2, nrow(data), replace=TRUE, prob=c(0.80 , 0.20))
train_data=data[ind==1,]
test_data=data[ind==2,]
dim(train_data)
[1] 802  11
dim(test_data)
[1] 198  11

We partition the data the data into (80% training, 20% testing). This result in 802 object in the training set and 198 object in the testing set.

information gain

library(party)
myFormula<- will_go_to_college~ + type_school+school_accreditation+gender+interest+residence+parent_age+parent_salary+house_area+average_grades+parent_was_in_college

dataset_ctree<-ctree(myFormula, data=train_data)
table(predict(dataset_ctree), train_data$will_go_to_college)
   
      0   1
  0 344  41
  1  56 361
plot(dataset_ctree,type="simple")

testPred <- predict(dataset_ctree, newdata = test_data)
result<-table(testPred, test_data$will_go_to_college )
library(caret)
co_result <- confusionMatrix(result,positive="1")
print(co_result)
Confusion Matrix and Statistics

        
testPred  0  1
       0 81 13
       1 19 85
                                          
               Accuracy : 0.8384          
                 95% CI : (0.7796, 0.8868)
    No Information Rate : 0.5051          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.6769          
                                          
 Mcnemar's Test P-Value : 0.3768          
                                          
            Sensitivity : 0.8673          
            Specificity : 0.8100          
         Pos Pred Value : 0.8173          
         Neg Pred Value : 0.8617          
             Prevalence : 0.4949          
         Detection Rate : 0.4293          
   Detection Prevalence : 0.5253          
      Balanced Accuracy : 0.8387          
                                          
       'Positive' Class : 1               
                                          

this matrix below show the evaluation model on testing data

Evaluation method Method
Accuracy 83.84 %
Error rate 16.16%
Sensitivity 86.73%
Specificity 81.00%
Precision 81.73%

In this partition of the decision tree, the average grade with the highest information gain is selected as the root attribute. The selection is based on the fact that the average grade attribute has the lowest conditional entropy, the model aims to create a more effective split and improve the classification accuracy. also to mention that attribute was not included in the tree gender,parent was in collage, parent age

On a positive note, the accuracy in this partition has increased by 1.47% compared to the previous partition. This improvement in accuracy can be attributed to the fact that the training data has been increased by 10%. By having more data for training, the model has a larger and more diverse set of examples to learn from, allowing it to make more accurate predictions.

Gain ratio

library(RWeka)


C45Fit <- J48( will_go_to_college~.,data=train_data)
table(train_data$will_go_to_college, predict(C45Fit))
   
      0   1
  0 378  22
  1  16 386
plot(C45Fit,type = "simple")

testPred <- predict(C45Fit, newdata = test_data)
results <- confusionMatrix(testPred, test_data$will_go_to_college, positive = "1")
print(results)
Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 76 15
         1 24 83
                                         
               Accuracy : 0.803          
                 95% CI : (0.7407, 0.856)
    No Information Rate : 0.5051         
    P-Value [Acc > NIR] : <2e-16         
                                         
                  Kappa : 0.6064         
                                         
 Mcnemar's Test P-Value : 0.2002         
                                         
            Sensitivity : 0.8469         
            Specificity : 0.7600         
         Pos Pred Value : 0.7757         
         Neg Pred Value : 0.8352         
             Prevalence : 0.4949         
         Detection Rate : 0.4192         
   Detection Prevalence : 0.5404         
      Balanced Accuracy : 0.8035         
                                         
       'Positive' Class : 1              
                                         

this matrix below show the evaluation model on testing data

Evaluation method Method
Accuracy 80.03%
Error rate 19.97%
Sensitivity 84.69%
Specificity 76.00%
Precision 77.57%

In this particular model that utilizes the gain ratio, it is observed that the decision tree is once again more complex compared to the other models. The attribute chosen as the root attribute is parent salary, based on its high purity and resulting high gain ratio. However, it is important to note that the gender attribute was not included in the decision tree. Ultimately, the gain ratio model’s construction involved careful considerations and prioritized the parent’s salary as root

Gini index

library(rpart)
library(rpart.plot)
cart_fit=rpart(will_go_to_college~., data=train_data, method="class",cp=0.008)


rpart.plot(cart_fit)


testPred<- predict(cart_fit,newdata=test_data,type="class")
results<- confusionMatrix(testPred,test_data$will_go_to_college,positive="1")
print(results)
Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 85 23
         1 15 75
                                          
               Accuracy : 0.8081          
                 95% CI : (0.7462, 0.8605)
    No Information Rate : 0.5051          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.6158          
                                          
 Mcnemar's Test P-Value : 0.2561          
                                          
            Sensitivity : 0.7653          
            Specificity : 0.8500          
         Pos Pred Value : 0.8333          
         Neg Pred Value : 0.7870          
             Prevalence : 0.4949          
         Detection Rate : 0.3788          
   Detection Prevalence : 0.4545          
      Balanced Accuracy : 0.8077          
                                          
       'Positive' Class : 1               
                                          

this matrix below show the evaluation model on testing data

Evaluation method Method
Accuracy 80.81%
Error rate 19.19%
Sensitivity 76.53%
Specificity 85.00%
Precision 83.33%

for the first time house area was chosen as the root due to its high purity, as indicated by the high Gini index.

The model relies on the parent_salary attribute to split the tree multiple times, indicating its importance in determining the classification. This suggests that the parent’s salary has a significant impact on whether a student will go to college or not, and it is repeatedly used as a criterion for further branching in the tree.

It’s worth noting that 20% of the dataset follows the rule:

IF house_area < 0.54 and parent_salary < 0.42, THEN will go to college = 1.

This rule implies that when the house area is below a certain threshold and the parent’s salary is below another threshold, there is a higher likelihood of the student going to college.

Moreover, 28% of the dataset follows the rule:

IF house_area >= 0.54 and parent_salary >= 0.37 and type_school = 1, THEN will go to college = 1. This rule suggests that when the house area is above a certain threshold, the parent’s salary is above another threshold, and the type of school is 1, there is a higher probability of the student going to college.

Overall, these rules cover a significant portion of the dataset, 48% of it. This indicates that a large amount of data can be classified based on these specific rules, providing valuable insights into the factors that influence a student’s likelihood of attending college. we only mentioned those two rules duo to their high percentage of data

50% training, 50% testing
set.seed(123)
ind=sample(2, nrow(data), replace=TRUE, prob=c(0.50 , 0.50))
train_data=data[ind==1,]
test_data=data[ind==2,]
dim(train_data)
[1] 493  11
dim(test_data)
[1] 507  11

We partition the data the data into (50% training, 50% testing). This result in 493 object in the training set and 507 object in the testing set.

information gain

library(party)
myFormula<- will_go_to_college~ + type_school+school_accreditation+gender+interest+residence+parent_age+parent_salary+house_area+average_grades+parent_was_in_college

dataset_ctree<-ctree(myFormula, data=train_data)
table(predict(dataset_ctree), train_data$will_go_to_college) 
   
      0   1
  0 211  23
  1  40 219
plot(dataset_ctree,type="simple")

testPred <- predict(dataset_ctree, newdata = test_data)
result<-table(testPred, test_data$will_go_to_college )

library(caret)
co_result <- confusionMatrix(result, positive = "1")
print(co_result)
Confusion Matrix and Statistics

        
testPred   0   1
       0 196  27
       1  53 231
                                          
               Accuracy : 0.8422          
                 95% CI : (0.8075, 0.8729)
    No Information Rate : 0.5089          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6837          
                                          
 Mcnemar's Test P-Value : 0.005189        
                                          
            Sensitivity : 0.8953          
            Specificity : 0.7871          
         Pos Pred Value : 0.8134          
         Neg Pred Value : 0.8789          
             Prevalence : 0.5089          
         Detection Rate : 0.4556          
   Detection Prevalence : 0.5602          
      Balanced Accuracy : 0.8412          
                                          
       'Positive' Class : 1               
                                          

this matrix below show the evaluation model on testing data

Evaluation method Method
Accuracy 84.22%
Error rate 15.78%
Sensitivity 89.53%
Specificity 78.71%
Precision 81.34%

In this decision tree model, the root attribute is determined by selecting the average grade with the highest information gain. This indicates that the average grade is considered the most informative feature for classifying the data, as it provides significant insights into the classification process.

To improve the model’s predictive capability, additional splitting criteria are incorporated based on their information gain values. These criteria include type_school, interest, residence, parent_salary, house_area, and whether the parent was in college. By considering these attributes, the model aims to capture more relevant information and enhance its ability to classify the data accurately.

In the tree structure, the model utilizes both the house area and average grades as splitting criteria in multiple levels. This means that the data is divided into different branches of the tree based on these attributes, allowing for more refined classification.

gender, parent age, and school accreditation are not included in the tree due to either limited availability of the data or their lower information gain values, indicating that they may have less impact on the classification process according to the model’s evaluation.

Gain ratio

library(RWeka)
C45Fit <- J48( will_go_to_college~.,data=train_data)
table(train_data$will_go_to_college, predict(C45Fit))
   
      0   1
  0 234  17
  1  13 229
plot(C45Fit,type = "simple")

testPred <- predict(C45Fit, newdata = test_data)
results <- confusionMatrix(testPred, test_data$will_go_to_college, positive = "1")
print(results)
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 206  32
         1  43 226
                                          
               Accuracy : 0.8521          
                 95% CI : (0.8181, 0.8818)
    No Information Rate : 0.5089          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.7038          
                                          
 Mcnemar's Test P-Value : 0.2482          
                                          
            Sensitivity : 0.8760          
            Specificity : 0.8273          
         Pos Pred Value : 0.8401          
         Neg Pred Value : 0.8655          
             Prevalence : 0.5089          
         Detection Rate : 0.4458          
   Detection Prevalence : 0.5306          
      Balanced Accuracy : 0.8516          
                                          
       'Positive' Class : 1               
                                          

this matrix below show the evaluation model on testing data

Evaluation method Method
Accuracy 85.21%
Error rate 14.79%
Sensitivity 87.60%
Specificity 82.73%
Precision 84.01%

In contrast to the previous decision tree, the current tree is more complex and includes all the attributes except for parent age. Notably, for the first time, gender is included as an attribute in the tree. This indicates that the decision tree now takes into account a wider range of factors, including gender, to make its classifications. By incorporating these additional attributes, the model aims to capture more nuanced patterns and relationships in the data.

The attribute with the highest Gain ratio is chosen as the root attribute. In this particular case, Parent salary is determined to have the highest Gain ratio and is selected as the root attribute. This suggests that Parent salary provides the most significant reduction in uncertainty and is considered crucial for classifying the data accurately.

Gini index

library(rpart)
library(rpart.plot)
cart_fit=rpart(will_go_to_college~., data=train_data, method="class",cp=0.008)

rpart.plot(cart_fit)


testPred<- predict(cart_fit,newdata=test_data,type="class")
results<- confusionMatrix(testPred,test_data$will_go_to_college,positive="1")
print(results)
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 204  30
         1  45 228
                                          
               Accuracy : 0.8521          
                 95% CI : (0.8181, 0.8818)
    No Information Rate : 0.5089          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.7037          
                                          
 Mcnemar's Test P-Value : 0.106           
                                          
            Sensitivity : 0.8837          
            Specificity : 0.8193          
         Pos Pred Value : 0.8352          
         Neg Pred Value : 0.8718          
             Prevalence : 0.5089          
         Detection Rate : 0.4497          
   Detection Prevalence : 0.5385          
      Balanced Accuracy : 0.8515          
                                          
       'Positive' Class : 1               
                                          

this matrix below show the evaluation model on testing data

The attribute chosen as the root attribute in this decision tree is average grades, primarily due to its high purity, indicating a high Gini index. The Gini index measures the impurity of a node in a decision tree, and a high purity value suggests that the attribute is effective for classification.
Evaluation method Method
Accuracy 85.21%
Error rate 24.79%
Sensitivity 88.37%
Specificity 81.93%
Precision 83.52%

27% of the dataset follows the rule:

IF average_grades=+A,+B,A and parent_salary>=0.37 and house_area>=0.41 and type_school=1, THEN will_go_to_collage=0.

Furthermore, 17% of the dataset follows the rule:

IF average_grades=B,+C and house_area<=0.41, THEN will_go_to_collage=1.

These rules highlight the complexity and variability in determining whether a student will go to college. Even if a student has stable financial circumstances and good grades, it does not guarantee their enrollment in college. in contract to the second which mean Other factors could play a role in the decision-making process.

we only mentioned those two rules duo to their high percentage of data

final analysis

information gain

Information Gain is a measure used in the ID3 algorithm for attribute selection in decision tree learning. It quantifies the amount of information obtained about the class variable when a given attribute is used to split the data. The Information Gain is calculated by substract the entropy Expected information needed to classify a tuple in D a with Conditional Entropy Information needed after using A to split D into v partitions, to classify D . A higher Information Gain indicates that the attribute is more informative and should be chosen as the splitting criterion.

for our trees,

average grade was chosen as the root attribute for all 3 trees with the highest information gain This attribute Reflects the least randomness or “impurity” in these partitions and guarantees that a simple tree is found.

50%-50% 70%-30% 80%-20%
Accuracy 84.22% 82.37 % 83.84 %
Error rate 15.78% 17.63% 16.16%
Sensitivity 89.53% 75.89% 86.73%
Specificity 78.71% 88.31% 81.00%
Precision 81.34% 85.60% 81.73%

the evaluation method table shows that the best partiton for information gain shows that 50%-50% is the best model based on accuracy(tuples that are correctly classified) with precntages 84.22% and Sensitivity (True positive recognition rate) with 89.53% but the lowest Specificity(True negative recognition rate) with 78.71%. overall it’s the best model for information gain has the lowest error rate 15.78%

gain ratio

Gain Ratio is an normaliztion for Information Gain and is used in the C4.5 algorithm, which is an extension of ID3. While Information Gain tends to favor attributes with many multivariate It is calculated by dividing the Information Gain by SplitInfo a(D).

for our trees,

Parent salary was chosen as the root attribute for all 3 trees due to its high Gain ratio, the trees split are unbalanced since gain ratio Tends to prefer unbalanced splits in which one partition is much smaller than the others

50%-50% 70%-30% 80%-20%
Accuracy 85.21% 80.34% 80.03%
Error rate 14.79% 19.66% 19.97%
Sensitivity 87.60% 77.30% 84.69%
Specificity 82.73% 83.12% 76.00%
Precision 84.01% 80.74% 77.57%

the evaluation method table shows that the best partition for gain ratio shows that 50%-50% is the best model based on accuracy(tuples that are correctly classified) with percntages 85.21% and Sensitivity (True positive recognition rate) with 87.60% and Precision(tuples labeled as positive are actually positive) with percentages 84.01% but the lowest Specificity(True negative recognition rate) with 82.73%. overall it’s the best model dor gain ratio with the lowest error rate 14.79%

gini index

Gini Index is a criterion used in the CART (Classification and Regression Trees) algorithm for attribute selection. It measures the impurity of a set of tuples, with lower values Gini a(D) indicating higher purity. The Gini Index is calculated by summing the squared probabilities of each class label in the set

for our trees,

gini index had different root (parent salary/house area/averages grades) for each tree

50%-50% 70%-30% 80%-20%
Accuracy 85.21% 79.66% 80.81%
Error rate 14.79% 20.34% 19.19%
Sensitivity 88.37% 78.01% 76.53%
Specificity 81.93% 81.17% 85.00%
Precision 83.52% 79.14% 83.33%

the evaluation method table shows that the best partition for gain ratio shows that 50%-50% is the best model based on accuracy(tuples that are correctly classified) with percentages 85.21% and Sensitivity (True positive recognition rate) with 88.37% and Precision(tuples labeled as positive are actually positive) with percentages 83.52% overall it’s the best model for gini index with the lowest error rate 14.79%

noticeable notes

important to note that gender was not included in 7 of 9 trees which

parent’s salary was the root of 4 of 9 trees

average grades was the root of 4 of 9 trees

house area was the root only once 1 of 9 tree

In summary, the analysis of the decision trees reveals that financial circumstances, as indicated by parent’s salary, and academic performance, as indicated by average grades, are crucial factors for contemporary universities when predicting college enrollment. In this particular dataset, gender is not considered highly important in the decision-making process.

best model

the 50% training, 50% testing was the best partition for all attribute selection measures with accuracies 85.21%. for both gini index and gain ratio

Both attribute selection measures yield the same accuracy and error rate, so we need to check the remaining evaluation metrics that is provided:

gain ratio gini index
Accuracy 85.21% 85.21%
Error rate 14.79% 14.79%
Sensitivity 87.60% 88.37%
Specificity 82.73% 81.93%
Precision 84.01% 83.52%

Comparing the sensitivity, specificity, and precision:

- Sensitivity: Gini Index has a higher sensitivity value (88.37%) compared to Gain Ratio (87.60%). Higher sensitivity indicates a better ability to correctly identify positive instances so gini index is better regarding Sensitivity.

- Specificity: Gini Index has a lower specificity value (81.93%) compared to Gain Ratio (82.73%). Higher specificity indicates a better ability to correctly identify negative instances.

- Precision: Gain Ratio has a higher precision value (84.01%) compared to Gini Index (83.52%). Higher precision indicates a better ability to correctly classify positive instances.

so gain ratio is better regarding Specificity and Precision .

Based on these comparisons, it is difficult to determine definitively which attribute selection measure is the best. The choice between Gini Index and Gain Ratio may depend on the specific requirements and priorities of the problem but since we focus on the true negative more than true positive because our goal of classification to determine who will not go to collage, so we will choose the model based on the Specificity and Precision (Gain ratio)

our final tree is:

Clustering Analysis:

In this analysis, we apply K-means clustering to the dataset using different values of K. K-means clustering is an unsupervised learning algorithm that partitions the data into K clusters based on similarity. We will explore three different values of K and evaluate the clustering results using various metrics.

Removing the class label and preparing the dataset for Clustering

 
original_data <- Preprocessed_dataset

# Remove any non-numeric attributes
numeric_data <- original_data[, sapply(original_data, is.numeric)]

# Remove the class label 'will_go_to_college'
numeric_data <- numeric_data[, !(names(numeric_data) == 'will_go_to_college')]

Now, the ‘numeric_dataset’ dataset contains only numeric attributes without the class label, which makes it ready for the clustering process.

___________________________________________________________________________________

K=2

In our exploration of k-means clustering, we will start by examining the dataset with K=2, representing an initial attempt to partition the data into two distinct clusters. This initial analysis will serve as a foundational step in understanding the inherent structure and patterns

# k-means clustering set a seed for random number generation to make the results reproducible 
set.seed(8953)

# run kmeans clustering to find 2 clusters
kmeans.result <- kmeans(numeric_data, 2)

# visualize clustering
library(factoextra)
fviz_cluster(kmeans.result, data = numeric_data)

NA
NA
NA

The Silhouette coefficient

#average for each cluster 
avg_sil <- silhouette(kmeans.result$cluster, dist(numeric_data)) 

#k-means clustering with estimating k and initializations 
fviz_silhouette(avg_sil)
NA

The total within-cluster sum of squares

# Calculate total within-cluster sum of squares
total_withinss <- kmeans.result$tot.withinss
cat("Total Within-Cluster Sum of Squares:", sum(total_withinss), "\n")
Total Within-Cluster Sum of Squares: 8669.059 
true_labels <- c(1, 1, 2, 1, 2, 2, 3, 3, 4, 4)  

cluster_assignments <- kmeans.result$cluster
 

BCubed recall and precision

# Calculate BCubed precision
precision <- 0
for (i in unique(true_labels)) {
  cluster_indices <- which(true_labels == i)
  precision <- precision + sum((table(cluster_assignments[cluster_indices]) * (table(cluster_assignments[cluster_indices]) - 1)) / sum(table(cluster_assignments[cluster_indices])))
}
precision <- precision / sum(table(cluster_assignments))

# Calculate BCubed recall
recall <- 0
for (j in unique(cluster_assignments)) {
  cluster_indices <- which(cluster_assignments == j)
  recall <- recall + sum((table(true_labels[cluster_indices]) * (table(true_labels[cluster_indices]) - 1)) / sum(table(true_labels[cluster_indices])))
}
recall <- recall / sum(table(true_labels))

cat("BCubed Precision:", precision, "\n")
BCubed Precision: 0.002333333 
cat("BCubed Recall:", recall, "\n")
BCubed Recall: 0.1166667 

Cluster Analysis for K=2:

  • Silhouette Width: 0.36

    • Indicates relatively well-separated and distinct clusters.
  • Within-Cluster Sum of Squares: 8669.059

    • Moderate compactness; clusters are not too tight.
  • BCubed Precision: 0.0023

    • Low precision suggests potential misassignments.
  • BCubed Recall: 0.1167

    • Low recall indicates incomplete representation of true clusters.

So, the clusters are moderately compact with a reasonable silhouette width, but low precision and recall suggest potential misclassification issues.

___________________________________________________________________________________

K=5

Moving to K=5, our exploration in k-means clustering advances to a configuration with five clusters. This adjustment allows for a more detailed partitioning of the dataset, potentially revealing finer grained patterns and distinctions among the data points. The analysis at K=5 will offer insights into the nature of these clusters, assessing their characteristics, compactness, and separation, as we strive to optimize the clustering configuration for the specific nuances of our dataset.

# k-means clustering set a seed for random number generation to make the results reproducible 
set.seed(8953)

# run kmeans clustering to find 4 clusters
kmeans.result <- kmeans(numeric_data, 5)

# visualize clustering
library(factoextra)
fviz_cluster(kmeans.result, data = numeric_data)

NA
NA
NA

The Silhouette coefficient

#average for each cluster 
avg_sil <- silhouette(kmeans.result$cluster, dist(numeric_data)) 

#k-means clustering with estimating k and initializations 
fviz_silhouette(avg_sil)
NA

The total within-cluster sum of squares

# Calculate total within-cluster sum of squares
total_withinss <- kmeans.result$tot.withinss
cat("Total Within-Cluster Sum of Squares:", sum(total_withinss), "\n")
Total Within-Cluster Sum of Squares: 4303.841 
true_labels <- c(1, 1, 2, 1, 2, 2, 3, 3, 4, 4)  

cluster_assignments <- kmeans.result$cluster
 

BCubed recall and precision

# Calculate BCubed precision
precision <- 0
for (i in unique(true_labels)) {
  cluster_indices <- which(true_labels == i)
  precision <- precision + sum((table(cluster_assignments[cluster_indices]) * (table(cluster_assignments[cluster_indices]) - 1)) / sum(table(cluster_assignments[cluster_indices])))
}
precision <- precision / sum(table(cluster_assignments))

# Calculate BCubed recall
recall <- 0
for (j in unique(cluster_assignments)) {
  cluster_indices <- which(cluster_assignments == j)
  recall <- recall + sum((table(true_labels[cluster_indices]) * (table(true_labels[cluster_indices]) - 1)) / sum(table(true_labels[cluster_indices])))
}
recall <- recall / sum(table(true_labels))

cat("BCubed Precision:", precision, "\n")
BCubed Precision: 0.001666667 
cat("BCubed Recall:", recall, "\n")
BCubed Recall: 0.1333333 

Cluster Analysis for K=5:

  • Silhouette Width: 0.29

    • Indicates less well-separated clusters compared to k=2
  • Within-Cluster Sum of Squares: 4303.841

    • Lower within-cluster sum suggests more compact clusters than k=2
  • BCubed Precision: 0.00166

    • Low precision implies potential misclassifications.
  • BCubed Recall: 0.1333

    • Slightly higher recall compared to k=2, but still low.

Final Analysis: Clusters are more compact than k=2 but less well-separated. Precision and recall remain low, indicating potential misclassifications.

___________________________________________________________________________________

K=7

Proceeding to k=7, our investigation in k-means clustering expands further as we explore a configuration with seven clusters. This adjustment aims to capture even more nuanced patterns and variations within the dataset. Analyzing the characteristics of the seven clusters will provide a more granular understanding of the underlying structure, potentially revealing subtleties that might not be as evident with fewer clusters. As we delve into k=7, our objective is to refine our clustering configuration to align more closely with the intricate details present in the dataset.

# k-means clustering set a seed for random number generation to make the results reproducible 
set.seed(8953)

# run kmeans clustering to find 2 clusters
kmeans.result <- kmeans(numeric_data, 7)

# visualize clustering
library(factoextra)
fviz_cluster(kmeans.result, data = numeric_data)

NA
NA
NA
NA

The Silhouette coefficient

#average for each cluster 
avg_sil <- silhouette(kmeans.result$cluster, dist(numeric_data)) 

#k-means clustering with estimating k and initializations 
fviz_silhouette(avg_sil)
NA

The total within-cluster sum of squares

# Calculate total within-cluster sum of squares
total_withinss <- kmeans.result$tot.withinss
cat("Total Within-Cluster Sum of Squares:", sum(total_withinss), "\n")
Total Within-Cluster Sum of Squares: 3389.656 
true_labels <- c(1, 1, 2, 1, 2, 2, 3, 3, 4, 4)  

cluster_assignments <- kmeans.result$cluster
 

BCubed recall and precision

# Calculate BCubed precision
precision <- 0
for (i in unique(true_labels)) {
  cluster_indices <- which(true_labels == i)
  precision <- precision + sum((table(cluster_assignments[cluster_indices]) * (table(cluster_assignments[cluster_indices]) - 1)) / sum(table(cluster_assignments[cluster_indices])))
}
precision <- precision / sum(table(cluster_assignments))

# Calculate BCubed recall
recall <- 0
for (j in unique(cluster_assignments)) {
  cluster_indices <- which(cluster_assignments == j)
  recall <- recall + sum((table(true_labels[cluster_indices]) * (table(true_labels[cluster_indices]) - 1)) / sum(table(true_labels[cluster_indices])))
}
recall <- recall / sum(table(true_labels))

cat("BCubed Precision:", precision, "\n")
BCubed Precision: 0.0006666667 
cat("BCubed Recall:", recall, "\n")
BCubed Recall: 0.06666667 

Cluster Analysis for k=7:

  • Silhouette Width: 0.27

    • Indicates clusters are less well-separated compared to k=2 and k=5
  • Within-Cluster Sum of Squares: 3389.656

    • Lower within-cluster sum suggests more compact clusters than k=5
  • BCubed Precision: 0.00067

    • Very low precision indicating significant misclassifications.
  • BCubed Recall: 0.067

    • Low recall suggesting incomplete representation of true clusters.

Final Analysis: Clusters are less well-separated than k=2 and k=5 and while more compact than k=5, precision and recall remain low, indicating potential misclassifications.

___________________________________________________________________________________

Elbow Method Insight:

We implemented the Elbow Method for determining the optimal number of clusters (k) in a k-means clustering algorithm.

# Function to calculate total within-cluster sum of squares (wss)
wss <- function(k) {
  kmeans_result <- kmeans(numeric_data, centers = k, nstart = 10)  # You can adjust nstart based on your preference
  return(sum(kmeans_result$tot.withinss))
}

# Calculate the total within-cluster sum of squares for different values of k
k_values <- 1:10  # You can adjust the range of k values
wss_values <- sapply(k_values, wss)

# Plot the elbow curve
plot(k_values, wss_values, type = "b", pch = 19, frame = FALSE, 
     xlab = "Number of Clusters (k)", ylab = "Total Within-Cluster Sum of Squares (WSS)",
     main = "Elbow Method")

NA
NA

The elbow method has indicated that K=5 is a reasonable choice in terms of balancing the trade-off between capturing variance and not overly complicating the model with too many clusters.

Findings

Classification:

50%-50% 70%-30% 80%-20%
ig igr gini ig igr gini ig igr gini
Accuracy 84.22% 85.21% 85.21% 82.37 % 80.34% 79.66% 83.84 % 80.03% 80.81%
Error rate 15.78% 14.79% 14.79% 17.63% 19.66% 20.34% 16.16% 19.97% 19.19%
Sensitivity 89.53% 87.60% 88.37% 75.89% 77.30% 78.01% 86.73% 84.69% 76.53%
Specificity 78.71% 82.73% 81.93% 88.31% 83.12% 81.17% 81.00% 76.00% 85.00%
Precision 81.34% 84.01% 83.52% 85.60% 80.74% 79.14% 81.73% 77.57% 83.33%

Accuracy (recognition rate): Percentage of test set tuples that are correctly classified

Error rate (misclassification rate): 1 – accuracy

Sensitivity (recall): True positive recognition rate

Specificity: True negative recognition rate

Precision (exactness): What % of tuples labeled as positive are actually positive.

For partition 50%-50% : gain ratio and gini index equals in accuracy 85.21% but as we mention before in best model we choose based on Precision 84.01% and Specificity 82.73% , Gain rato is the best model.

For partition 70%-30%: information gain is the best model for this partition with accuracy 82.37 %

For partition 80%-20%: information gain is the best model for this partition with accuracy 83.84 %

overall 50%-50% wasa the best partition with algorithm gini index with accuracy 85.21%.

Clustering

Metric K=2 K=5 K=7
Average Silhouette Width 0.36 0.29 0.27
Total Within-Cluster Sum of Squares 8669.059 4303.841 3389.656
BCubed Precision 0.002333333 0.001666667 0.00067
BCubed Recall 0.1166667 0.1333333 0.067

In our overall analysis, we evaluated the performance of the k-means clustering algorithm for three different values of k (2, 5, and 7). The key metrics provide insights into the characteristics of the resulting clusters:

  1. Average Silhouette Width:

    • The average silhouette width measures the separation and cohesion of clusters. As k increases from 2 to 7, there is a gradual decline in silhouette width, indicating that clusters become less distinct. However, k=5 still maintains a reasonable silhouette width, suggesting a balance between separation and cohesion.
  2. Total Within-Cluster Sum of Squares:

    • The total within-cluster sum of squares reflects the compactness of clusters. Notably, k=7 exhibits the lowest sum of squares, implying more compact clusters compared to k=5 and k=2. This aligns with the expectation that a higher k tends to yield more compact clusters.
  3. BCubed Precision and Recall:

    • BCubed precision and recall metrics assess the accuracy and completeness of clustering. Precision and recall values are relatively low across all k values, indicating potential misclassifications. However, k=5 stands out with slightly higher precision and recall compared to k=2 and k=7, suggesting a better balance between accuracy and completeness.

Considering the results obtained from the elbow method, which identifies k=5 as the optimal number of clusters, our analysis aligns with this recommendation. The detailed examination of silhouette width, within-cluster sum of squares, and BCubed metrics reinforces the choice of k=5 as a suitable configuration, striking a balance between cluster distinctiveness and internal cohesion. This comprehensive evaluation supports the decision to proceed with k=5 for a more refined and effective clustering solution.

In the following figures, the visualisation of the clusters for each trial:

K=2:

k=5

K=7

Classification vs Clustering

classification offers a more interpretable and actionable solution compared to clustering, as it assigns clear labels to data points. In addition, the availability of class labels enhances the model’s ability to generalize and make informed predictions. While clustering struggle with overlapping clusters and lack of clear evaluation metrics, classification leverages the labeled data to optimize performance and provide meaningful insights. Therefore, for our dataset, the utilization of classification is not only justified by its high accuracy but also by its ability to address the specific characteristics and challenges presented in the data.

---
title: "R Notebook"
output: html_notebook
---

Important packages:

```{r}
setwd("Dataset")
Dataset <- read.csv("data.csv")
Preprocessed_dataset <-  read.csv("preprocessed_dataset.csv") 
if(!require(ggplot2)){
install.packages("ggplot2")}
library(ggplot2)
if(!require(dplyr)){
install.packages("dplyr")}
library(dplyr) 

if(!require(ltm)){
install.packages("ltm")}
library(ltm)
if (!require(cluster, quietly = TRUE)) {
  install.packages("cluster")
}
if (!require(factoextra, quietly = TRUE)) {
  install.packages("factoextra")
}
 
if (!require(caret, quietly = TRUE)) {
  install.packages("caret")
}
if (!require(party, quietly = TRUE)) {
  install.packages("party")
}
if (!require(partykit, quietly = TRUE)) {
  install.packages("partykit")
}

if (!require(rpart, quietly = TRUE)) {
  install.packages("rpart")
}
if (!require(rpart.plot, quietly = TRUE)) {
  install.packages("rpart.plot")
}
if (!require(tidyverse, quietly = TRUE)) {
  install.packages("tidyverse")
}

if (!require(RWeka, quietly = TRUE)) {
  install.packages("RWeka")
}
```

# IT326 Project {style="color: gray"}

\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

## The Goal and problem

Our primary objective of this analysis is to classify whatever a student will go to college or not we aim to discover the most influential variables and their relationships with the decision not to attend college to provide counseling session to the student to help them

## Data mining Task

using the classification methods and to identify the main factors and reasons why students are less likely to pursue higher education using class label indicated "will_go_to_college" being 'False'. By leveraging the provided dataset with attributes such as school type, school accreditation, gender, interest in college, residence and use clustering to group pattern of the student who will go to collage or not

## The Source

Kaggle.com

## URL :

<https://www.kaggle.com/datasets/saddamazyazy/go-to-college-dataset>

## General information

-   Number of attributes : 11
-   Number of rows (objects) : 1000
-   The class label: The class label in our project is the attribute "will_go_to_college". This attribute is binary, meaning that it can take on two values: True for yes or False for no. The value of this attribute will be the target variable that we are trying to predict during our project.

------------------------------------------------------------------------

+-----------------------+------------------------------------------------------------+-----------+---------------------------------------------------------------------------+
| Attribute             | Description                                                | Type      | Possible values                                                           |
+-----------------------+------------------------------------------------------------+-----------+---------------------------------------------------------------------------+
| type_school           | The type of school the student attends                     | Binary    | Academic / Vocational                                                     |
+-----------------------+------------------------------------------------------------+-----------+---------------------------------------------------------------------------+
| school_accreditation  | The quality if school                                      | Binary    | A / B (A is better than B)                                                |
+-----------------------+------------------------------------------------------------+-----------+---------------------------------------------------------------------------+
| gender                | The student's gender                                       | Binary    | Male / Female                                                             |
+-----------------------+------------------------------------------------------------+-----------+---------------------------------------------------------------------------+
| interest              | The student's interest in going to college                 | Nominal   | Very interested /Interested / Less Interested / Not Interested /Uncertain |
+-----------------------+------------------------------------------------------------+-----------+---------------------------------------------------------------------------+
| residence             | The student's residence                                    | Binary    | Urban / Rural                                                             |
+-----------------------+------------------------------------------------------------+-----------+---------------------------------------------------------------------------+
| parent_age            | The age of the student's parents                           | Numeric   | 40 - 65                                                                   |
+-----------------------+------------------------------------------------------------+-----------+---------------------------------------------------------------------------+
| parent_salary         | The monthly salary of the student's parents in IDR/Rupiah. | Numeric   | 1000K - 10M                                                               |
|                       |                                                            |           |                                                                           |
|                       | [1Rupiah = 0.00024SAR]                                     |           |                                                                           |
+-----------------------+------------------------------------------------------------+-----------+---------------------------------------------------------------------------+
| house_area            | The size of the student's house in meter square            | Numeric   | 20 - 120                                                                  |
+-----------------------+------------------------------------------------------------+-----------+---------------------------------------------------------------------------+
| average_grades        | The student's average grades in school.                    | Numeric   | 75 - 98                                                                   |
+-----------------------+------------------------------------------------------------+-----------+---------------------------------------------------------------------------+
| parent_was_in_college | Whether the student's parents attended college.            | Binary    | True - False                                                              |
+-----------------------+------------------------------------------------------------+-----------+---------------------------------------------------------------------------+
| will_go_to_college    | Whether the student plans to go to college.                | Binary    | True - False                                                              |
+-----------------------+------------------------------------------------------------+-----------+---------------------------------------------------------------------------+

## Sample of our data

```{r}
head(Dataset)
```

Here are a sample of 6 row from our dataset.

## Missing values

```{r}
sum(is.na(Dataset))
```

There are no missing values in our dataset.

## Statistical graphs

Graph 1:

```{r}
df =data.frame(Dataset)
ggplot(data=df, aes(x = interest, fill = will_go_to_college)) +
  geom_bar() +
  scale_x_discrete(limits = c('Not Interested', 'Less Interested', 'Uncertain', 'Interested', 'Very Interested')) +
  labs(title = 'College interest vs College attendance ') +
  scale_fill_manual(values = c("True" = "antiquewhite2", "False" = "antiquewhite3")) +
  theme_minimal()


```

According to the graph, whether students are interested in going to college or not does not affect whether they actually end up attending Collage . There is a group of individuals who were interested in attending but did not receive acceptance, while others who were not interested were accepted.

Graph 2:

```{r}

filtered_True =filter(Dataset, will_go_to_college == 'True')
filtered_False =subset(Dataset, will_go_to_college =='False')

ggplot() +
  geom_density(data = filtered_True, aes(x = average_grades, fill = "Going to College"), alpha = 0.5) +
  geom_density(data = filtered_False, aes(x = average_grades, fill = "NOT Going to College"), alpha = 0.5) +
  labs(x = "Average Grades", y = "Density") +
  ggtitle("Comparison of Average Grades for Students Going to College and NOT Going to College") +
  scale_fill_manual(values = c("Going to College" = "antiquewhite4", "NOT Going to College" = "antiquewhite1"))

```

The graph shows that the average grades for students who had accepted to go to college were higher than those who did not enter college , and this indicates the existence of a correlation between those who going to college and the average grades

Graph 3:

```{r}
Dataset_percentage <- Dataset %>%
  group_by(type_school) %>%
  summarise(percentage = mean(will_go_to_college == "True") * 100)

# Create a percentage chart
ggplot(Dataset_percentage, aes(x = type_school, y = percentage, fill = type_school)) +
  geom_bar(stat = "identity") +
  labs(title = "Percentage of Students Going to College by Type of School",
       x = "Type of School",
       y = "Percentage") +
  scale_fill_manual(values = c("Academic" = "antiquewhite3", "Vocational" = "antiquewhite2")) +
  theme_minimal()

 
```

This graph shows the impact of the type of high school attended by students on their college attendance . Based on the bar chart:

-   among students from Academic high schools, 313 are going to college, and 296 are not.

-   among students from Vocational high schools, 187 are going to college, and 204 are not

These information tell us that a higher proportion of students from academic high schools are going to college compared to those from vocational schools which suggest that the type of school attended.

Graph 4:

```{r}
ggplot(Dataset, aes(x = average_grades)) +
  geom_histogram(binwidth = 5, fill = "antiquewhite2", color = "antiquewhite4") +
  labs(title = "Distribution of students' grades",
       x = "Students' average grades",
       y = "Frequency") +
  theme_minimal()
```

This histogram show us that the majority of the students in the dataset are performing well since it seems like their grades are spanning between 75 and 98. This analysis will help us determine whether the academic performance level of students is a contributing factor to their college attendance or not.

## Statistical Measures

1.  The student's academic performance analysis:

```{r}
summary(Dataset$average_grades)

```

The student grades in our dataset range from 75.00 to 98.00, with a median of 85.58 and an average of 86.10. This suggests that most students are doing well as none of them have average grades below 50. However, it's interesting to note that some students have much higher or lower grades than the average, mainly due to the wide range of grades..

2.  The socioeconomic status of students' families analysis:

```{r}
summary(Dataset$parent_salary)
```

In the dataset, we've got students parents with salaries ranging from 1,000,000 to 10,000,000 IDR/Rupiah. The median salary is 5,440,000 IDR/Rupiah, and the average is 5,381,570 IDR/Rupiah. This data tells us that many parents in our dataset earn less than the average salary in Indonesia, which is 146,000,000 IDR. This suggests that quite a few students in our dataset come from families with limited finances. And this financial situation could certainly impact their ability to get through college.

```{r}
summary(Dataset$house_area)
```

Additionally we can utilize the house area attribute to gain a deeper understanding of the socioeconomic status of students' families, where students with houses significantly larger than the mean might indicate a higher socioeconomic status, while those with houses considerably smaller than the mean might reflect a comparatively lower socioeconomic status. Based on the shown output, the house areas range from [20.00-120.00 ㎡]. The median house area is 75.50 ㎡ indicates that families with house areas around this value likely have moderate socioeconomic status with houses that neither very small nor very large.

3.  Understanding Parent Age Range and Variation in its Values

```{r}
summary(Dataset$parent_age)
SD=sd(Dataset$parent_age)
MeanAge=mean(Dataset$parent_age)
cat("coefficient of variation:",SD/MeanAge*100,"%")

```

This summary provides the range for age attribute [40,65] which indicates that all parent in middle age during this age parent have more concern about their children , the coefficient of variation= 6.7% which indicates lower variation ,and the value of attribute parent_agerare are relatively close to the mean overall 25% of them have an age below or equal to 50 , 75% have an age below or equal to 54 and the median value is 52

## Preprocessing data

we aim to increase the quality of data and transforming the data to make it suitable for analysis

## Outliers analysis

```{r}
###parent age outliers
quartiles <- quantile(Dataset$parent_age, probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(Dataset$parent_age)

Lower <- quartiles[1] - 1.5*IQR
Upper <- quartiles[2] + 1.5*IQR 

data_no_outlier <- subset(Dataset, Dataset$parent_age > Lower & Dataset$parent_age < Upper)
dim(data_no_outlier)

###parent salary outliers
quartiles <- quantile(Dataset$parent_salary, probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(Dataset$parent_salary)

Lower <- quartiles[1] - 1.5*IQR
Upper <- quartiles[2] + 1.5*IQR 

data_no_outlier <- subset(data_no_outlier, data_no_outlier$parent_salary> Lower & data_no_outlier$parent_salary < Upper)
dim(data_no_outlier)


###averge grades outliers
quartiles <- quantile(Dataset$average_grades, probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(Dataset$average_grades)

Lower <- quartiles[1] - 1.5*IQR
Upper <- quartiles[2] + 1.5*IQR 

data_no_outlier <- subset(data_no_outlier, data_no_outlier$average_grades> Lower & data_no_outlier$average_grades < Upper)
dim(data_no_outlier)


###house area outliers
quartiles <- quantile(Dataset$house_area, probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(Dataset$house_area)

Lower <- quartiles[1] - 1.5*IQR
Upper <- quartiles[2] + 1.5*IQR 

data_no_outlier <- subset(data_no_outlier, data_no_outlier$house_area> Lower & data_no_outlier$house_area < Upper)

Founded_Outliers=data.frame(anti_join(Dataset,data_no_outlier))
print(Founded_Outliers)
```

After conducting data analysis and identifying outliers, our inspection reveals that the detected outliers represent inherent variation within the population. Regarding Parent_age, outliers are observed for values below 44 and above 65. However, it should be noted the age from 40 to 65 fall within the expected mean of our dataset meaning that it doesn't indicate that they are outliers . For parent_salary, we found two outliers: one below 1,326,250 ind ≈ 85 USD and another above 9,416,250 ind ≈ 606 USD. The minimum and maximum values were determined to be 1,000,000 ind ≈ 64 USD and 10,000,000 ind ≈ 644 USD, respectively. In the case of grades, twelve outliers were identified, ranging from below 76 to above 97. Nevertheless, since the data falls within the acceptable range of 0 to 100, these outliers should be retained as they are still considered normal and within the usual grade range. Finally, for house_area, we found eleven outliers below 34.4m and above 115m, with the minimum being 20m and the maximum being 120m. However, these values are still considered typical for the population.

## Normalization

```{r}
normalize <- function(x) {return((x-min(x))/ (max(x)-min(x)))}
datasetWithoutNormalization<-Dataset
Dataset$parent_salary<-normalize(datasetWithoutNormalization$parent_salary)
Dataset$house_area<-normalize(datasetWithoutNormalization$house_area)
print(Dataset)
```

We applied normalization to the 'parent_salary' and 'house_area' attributes, scaling their values to a range between 0 and 1. This normalization process greatly facilitates data handling and analysis, ensuring that these attributes are on a consistent scale. Which will improve the reliability of our data analysis and enable better conclusions to be drawn from the dataset. Normalization is a crucial step in preparing the data for modeling, as it prevents attributes with larger numerical ranges from dominating the analysis and ensures fair treatment for all features.

## Discretization

```{r}


Dataset$average_grades [Dataset$average_grades >= 95] <- '+A'
Dataset$average_grades [95 >Dataset$average_grades & Dataset$average_grades >= 90] <- 'A'
Dataset$average_grades [90 >Dataset$average_grades & Dataset$average_grades >= 85] <- '+B'
Dataset$average_grades [85 >Dataset$average_grades & Dataset$average_grades >= 80] <- 'B'
Dataset$average_grades [80 >Dataset$average_grades & Dataset$average_grades >= 75] <- '+C'
Dataset$average_grades [75 >Dataset$average_grades & Dataset$average_grades >= 70] <- 'C'
Dataset$average_grades [70 >Dataset$average_grades & Dataset$average_grades >= 65] <- '+D'
Dataset$average_grades [65 >Dataset$average_grades & Dataset$average_grades >= 60] <- 'D'
Dataset$average_grades [60 >Dataset$average_grades & Dataset$average_grades >= 0] <- 'F'
Dataset$average_grades <- as.character(Dataset$average_grades )
print(Dataset)
```

We transformed the parent_age attribute into intervals by dividing the values to be fall on one of two possible interval labels with equal width which is(40,50],(50,60] by discretization the values well be simpler to classify or perform other methods that can help us later in our model.

and to better utilize and interpret the grades attributes for each student, we have converted the numeric grades into letter grades (A+, A, B+, B, C+, C, D+, D, F). This transformation was undertaken to focus on the general letter grade representation rather than the precise numerical values.

## Encoding

```{r}
Dataset$parent_was_in_college[Dataset$parent_was_in_college=="TRUE"]<-1
Dataset$parent_was_in_college[Dataset$parent_was_in_college=="True"]<-1
Dataset$parent_was_in_college[Dataset$parent_was_in_college=="FALSE"]<-0
Dataset$parent_was_in_college[Dataset$parent_was_in_college=="False"]<-0

Dataset$will_go_to_college[Dataset$will_go_to_college=="TRUE"]<-0
Dataset$will_go_to_college[Dataset$will_go_to_college=="True"]<-0
Dataset$will_go_to_college[Dataset$will_go_to_college=="FALSE"]<-1
Dataset$will_go_to_college[Dataset$will_go_to_college=="False"]<-1

Dataset$gender[Dataset$gender=="Female"]<-1
Dataset$gender[Dataset$gender=="Male"]<-0

Dataset$school_accreditation[Dataset$school_accreditation=="A"]<-1
Dataset$school_accreditation[Dataset$school_accreditation=="B"]<-0

Dataset$interest[Dataset$interest=="Very Interested"]<-4
Dataset$interest[Dataset$interest=="Interested"]<-3
Dataset$interest[Dataset$interest=="Less Interested"]<-2
Dataset$interest[Dataset$interest=="Not Interested"]<-1
Dataset$interest[Dataset$interest=="Uncertain"]<-0


Dataset$type_school[Dataset$type_school=="Academic"]<-1 
Dataset$type_school[Dataset$type_school=="Vocational"]<-0

Dataset$residence[Dataset$residence=="Urban"]<-1
Dataset$residence[Dataset$residence=="Rural"]<-0
print(Dataset)
```

Since encoding is an important step in data preprocessing that enables the use of categorical data in various data analysis and machine learning tasks, we encoded attributes like the 'parent was in college' attribute from (True, False) to (1, 0), and 'will go to college' from (True, False) to (0, 1). This encoding is carried out as we aim to predict the influencing factors. Additionally, we encoded the 'gender' attribute from (Female, Male) to (1, 0), 'school accreditation' from (A, B) to (1, 0), 'type_school' from (Academic, Vocational) to (1, 0), 'residence' from (Urban, Rural) to (1, 0), and 'interest' from (Very interested ,Interested , Less Interested , Not Interested ,Uncertain ) to (4,3,2, 1, 0) respectively. Encoding serves to simplify the data, reduce complexity, and enhance its suitability for modeling purposes.

## Correlation analysis Chi square test for nominal attribute:

```{r}

#1
C=chisq.test(Dataset$type_school , Dataset$will_go_to_college)
print(C)
#2

C=chisq.test(Dataset$school_accreditation , Dataset$will_go_to_college)
print(C)
#3
C=chisq.test(Dataset$gender , Dataset$will_go_to_college)
print(C)
#4
C=chisq.test(Dataset$interest , Dataset$will_go_to_college)
print(C)

#5
C=chisq.test(Dataset$residence , Dataset$will_go_to_college)
print(C)

#6
C=chisq.test(Dataset$average_grades , Dataset$will_go_to_college)
print(C)

#7
C=chisq.test(Dataset$parent_was_in_college , Dataset$will_go_to_college)
print(C)
```

All the attributes have X-square greater than the p-value which indicate a some association with the class label; therefore we reject the null hypothesis

we noticed for 'interest' and 'average grade' the analysis shows that X-square is much larger than p-value indicate the significant association of the two attributes with the decision of the student to go to the collage or not

## Correlation coefficient analysis for numeric attribute:

```{r}

biserial.cor(Dataset$parent_salary,Dataset$will_go_to_college, c("all.obs", "complete.obs"), level = 1)
biserial.cor(Dataset$house_area,Dataset$will_go_to_college, c("all.obs", "complete.obs"), level = 1)
biserial.cor(Dataset$parent_age,Dataset$will_go_to_college,c("all.obs", "complete.obs"), level = 1)
 
 

```

the analysis shows moderate correlation coefficient for parent salary and house area with the class label which indicate that they are relevant factors meaning that the higher the parent salary and the larger house area the higher probability for a student to enroll in a collage

where is the on other hand, the correlation coefficient for the parent age is very small which indicate that the parent age has little impact to the probability for student to enroll in a collage

## Feature selection:

ultimately based on the analysis of the correlation that we conducted on the relationship of the dataset attributes with the class label, and the understanding of the data and the context of each attribute and potential relevance to the class label we decided to not delete any of the attribute

## Classification:

The attribute selection measure is a heuristic technique employed to select the most optimal criteria for dividing a collection of data tuples into partitions that are as pure as possible. Several attribute selection measures are commonly utilized, such as Information Gain (ID3), Gain Ratio (C4.5), and Gini Index (CART). we will constraints for Each of these measures different partitions by dividing them in ratios of 50%-50%, 70%-30%, and 80%-20%.

### factor the data

```{r}


data<-Preprocessed_dataset
data$will_go_to_college <- as.factor(data$will_go_to_college)
data$residence <- as.factor(data$residence)
data$gender <- as.factor(data$gender)
data$parent_was_in_college <- as.factor(data$parent_was_in_college)
data$interest <- as.factor(data$interest)
data$type_school <- as.factor(data$type_school)
data$school_accreditation <- as.factor(data$school_accreditation)
data$average_grades <- as.factor(data$average_grades)

```

### balanced or imbalanced

```{r}

library(tidyverse) 
library(caret)
data$will_go_to_college<- as.numeric(data$will_go_to_college)
hist(data$will_go_to_college,col="coral")
prop.table(table(data$will_go_to_college))
```

we want to confirm that the distribution between the two label data is not too much different. Because imbalanced datasets can lead to imbalanced accuracy.

Fortunately ,our data is balanced

| **70% training, 30% testing** |
|-------------------------------|

```{r}
set.seed(123)
ind=sample(2, nrow(data), replace=TRUE, prob=c(0.70 , 0.30))
train_data=data[ind==1,]
test_data=data[ind==2,]
dim(train_data)
dim(test_data)
```

We partition the data the data into (70% training, 30% testing). This result in 705 object in the training set and 295 object in the testing set.

#### **information gain**

```{r}

library(party)
myFormula<- will_go_to_college~ + type_school+school_accreditation+gender+interest+residence+parent_age+parent_salary+house_area+average_grades+parent_was_in_college

dataset_ctree<-ctree(myFormula, data=train_data)
table(predict(dataset_ctree), train_data$will_go_to_college)
plot(dataset_ctree,type="simple")
testPred <- predict(dataset_ctree, newdata = test_data)
result<-table(testPred, test_data$will_go_to_college )
library(caret)
co_result <- confusionMatrix(result,positive="1")
print(co_result)
```

![](images/WhatsApp%20Image%202023-11-30%20at%2019.36.29_0e85e418.jpg)

this matrix below show the evaluation model on testing data

| Evaluation method | Method |
|-------------------|--------|
| Accuracy          | 82.37% |
| Error rate        | 17.63% |
| Sensitivity       | 75.89% |
| Specificity       | 88.31% |
| Precision         | 85.60% |

In this decision tree model, the average grade with the highest information gain is selected as the root attribute. This means that the average grade is considered the most informative feature for classifying the data.

To enhance the model's predictive capability, additional splitting criteria are incorporated based on their information gain values. These criteria include type_school, interest, residence, parent_salary, and house_area. By considering these attributes, the model aims to capture more relevant information from the data and improve its classification accuracy.

The model then utilizes the house area and average grades as splitting criteria to separate the data into different branches of the tree in multiples levels

However, it's worth noting that the decision tree in this case is considered to have a low level. This implies that the tree is relatively shallow and may not be able to capture intricate patterns or relationships present in the data. For example gender, parent_was_in_college, parent age, and school accreditation are not included in the tree due to limited availability of the data or their lower information gain values.

#### **Gain ratio**

```{r}
library(RWeka)
C45Fit <- J48( will_go_to_college~.,data=train_data)
table(train_data$will_go_to_college, predict(C45Fit))
plot(C45Fit,type = "simple")
testPred <- predict(C45Fit, newdata = test_data)
results <- confusionMatrix(testPred, test_data$will_go_to_college, positive = "1")
print(results)
```

![](images/WhatsApp%20Image%202023-11-30%20at%2019.53.30_4eacbc86.jpg)

this matrix below show the evaluation model on testing data

| Evaluation method | Method |
|-------------------|--------|
| Accuracy          | 80.34% |
| Error rate        | 19.66% |
| Sensitivity       | 77.30% |
| Specificity       | 83.12% |
| Precision         | 80.74% |

In contrast to the previous decision tree, the current tree is more complex and includes all the attributes except for gender. This means that the decision tree takes into account a wider range of factors to make its classifications. The attribute with the highest Gain ratio, which measures the effectiveness of a split, is chosen as the root attribute. In this case, Parent salary is determined to have the highest Gain ratio and is selected as the root attribute.

The model then repeatedly uses average grades as the splitting criterion at different levels of the tree. This indicates that the average grades are considered crucial for classifying the data and further subdividing the branches.

#### **Gini index**

```{r}
library(rpart)
library(rpart.plot)
cart_fit=rpart(will_go_to_college~., data=train_data, method="class",cp=0.008)


rpart.plot(cart_fit)

testPred<- predict(cart_fit,newdata=test_data,type="class")
results<- confusionMatrix(testPred,test_data$will_go_to_college,positive="1")
print(results)

```

this matrix below show the evaluation model on testing data

| Evaluation method | Method |
|-------------------|--------|
| Accuracy          | 79.66% |
| Error rate        | 20.34% |
| Sensitivity       | 78.01% |
| Specificity       | 81.17% |
| Precision         | 79.14% |

In this particular partition of the decision tree, Parent salary was once again chosen as the root attribute due to its high Gain index.

Gender and residence, on the other hand, were not included in this partition of the tree. Without sufficient data or a significant information gain associated with these attributes, the model may not have enough evidence to utilize them for accurate classification.

It's worth noting that gender was not included in this partition, and it is important to highlight that the accuracy of this specific model was relatively lower, at 79.66%.

| 80% training, 20% testing |
|---------------------------|

```{r}
set.seed(123)
ind=sample(2, nrow(data), replace=TRUE, prob=c(0.80 , 0.20))
train_data=data[ind==1,]
test_data=data[ind==2,]
dim(train_data)
dim(test_data)
```

We partition the data the data into (80% training, 20% testing). This result in 802 object in the training set and 198 object in the testing set.

#### **information gain**

```{r}
library(party)
myFormula<- will_go_to_college~ + type_school+school_accreditation+gender+interest+residence+parent_age+parent_salary+house_area+average_grades+parent_was_in_college

dataset_ctree<-ctree(myFormula, data=train_data)
table(predict(dataset_ctree), train_data$will_go_to_college)
plot(dataset_ctree,type="simple")
testPred <- predict(dataset_ctree, newdata = test_data)
result<-table(testPred, test_data$will_go_to_college )
library(caret)
co_result <- confusionMatrix(result,positive="1")
print(co_result)
```

![](images/WhatsApp%20Image%202023-11-30%20at%2020.15.36_e95c25e1.jpg)

this matrix below show the evaluation model on testing data

| Evaluation method | Method  |
|-------------------|---------|
| Accuracy          | 83.84 % |
| Error rate        | 16.16%  |
| Sensitivity       | 86.73%  |
| Specificity       | 81.00%  |
| Precision         | 81.73%  |

In this partition of the decision tree, the average grade with the highest information gain is selected as the root attribute. The selection is based on the fact that the average grade attribute has the lowest conditional entropy, the model aims to create a more effective split and improve the classification accuracy. also to mention that attribute was not included in the tree gender,parent was in collage, parent age

On a positive note, the accuracy in this partition has increased by 1.47% compared to the previous partition. This improvement in accuracy can be attributed to the fact that the training data has been increased by 10%. By having more data for training, the model has a larger and more diverse set of examples to learn from, allowing it to make more accurate predictions.

#### **Gain ratio**

```{r}
library(RWeka)


C45Fit <- J48( will_go_to_college~.,data=train_data)
table(train_data$will_go_to_college, predict(C45Fit))

plot(C45Fit,type = "simple")
testPred <- predict(C45Fit, newdata = test_data)
results <- confusionMatrix(testPred, test_data$will_go_to_college, positive = "1")
print(results)
```

![](images/WhatsApp%20Image%202023-11-30%20at%2020.17.25_c02a1b81.jpg)

this matrix below show the evaluation model on testing data

| Evaluation method | Method |
|-------------------|--------|
| Accuracy          | 80.03% |
| Error rate        | 19.97% |
| Sensitivity       | 84.69% |
| Specificity       | 76.00% |
| Precision         | 77.57% |

In this particular model that utilizes the gain ratio, it is observed that the decision tree is once again more complex compared to the other models. The attribute chosen as the root attribute is parent salary, based on its high purity and resulting high gain ratio. However, it is important to note that the gender attribute was not included in the decision tree. Ultimately, the gain ratio model's construction involved careful considerations and prioritized the parent's salary as root

#### **Gini index**

```{r}
library(rpart)
library(rpart.plot)
cart_fit=rpart(will_go_to_college~., data=train_data, method="class",cp=0.008)


rpart.plot(cart_fit)

testPred<- predict(cart_fit,newdata=test_data,type="class")
results<- confusionMatrix(testPred,test_data$will_go_to_college,positive="1")
print(results)

```

this matrix below show the evaluation model on testing data

| Evaluation method | Method |
|-------------------|--------|
| Accuracy          | 80.81% |
| Error rate        | 19.19% |
| Sensitivity       | 76.53% |
| Specificity       | 85.00% |
| Precision         | 83.33% |

for the first time house area was chosen as the root due to its high purity, as indicated by the high Gini index.

The model relies on the parent_salary attribute to split the tree multiple times, indicating its importance in determining the classification. This suggests that the parent's salary has a significant impact on whether a student will go to college or not, and it is repeatedly used as a criterion for further branching in the tree.

It's worth noting that 20% of the dataset follows the rule:

IF house_area \< 0.54 and parent_salary \< 0.42, THEN will go to college = 1.

This rule implies that when the house area is below a certain threshold and the parent's salary is below another threshold, there is a higher likelihood of the student going to college.

Moreover, 28% of the dataset follows the rule:

IF house_area \>= 0.54 and parent_salary \>= 0.37 and type_school = 1, THEN will go to college = 1. This rule suggests that when the house area is above a certain threshold, the parent's salary is above another threshold, and the type of school is 1, there is a higher probability of the student going to college.

Overall, these rules cover a significant portion of the dataset, 48% of it. This indicates that a large amount of data can be classified based on these specific rules, providing valuable insights into the factors that influence a student's likelihood of attending college. we only mentioned those two rules duo to their high percentage of data

| 50% training, 50% testing |
|---------------------------|

```{r}
set.seed(123)
ind=sample(2, nrow(data), replace=TRUE, prob=c(0.50 , 0.50))
train_data=data[ind==1,]
test_data=data[ind==2,]
dim(train_data)
dim(test_data)

```

We partition the data the data into (50% training, 50% testing). This result in 493 object in the training set and 507 object in the testing set.

#### **information gain**

```{r}
library(party)
myFormula<- will_go_to_college~ + type_school+school_accreditation+gender+interest+residence+parent_age+parent_salary+house_area+average_grades+parent_was_in_college

dataset_ctree<-ctree(myFormula, data=train_data)
table(predict(dataset_ctree), train_data$will_go_to_college) 

plot(dataset_ctree,type="simple")
testPred <- predict(dataset_ctree, newdata = test_data)
result<-table(testPred, test_data$will_go_to_college )

library(caret)
co_result <- confusionMatrix(result, positive = "1")
print(co_result)

```

![](images/WhatsApp%20Image%202023-11-30%20at%2020.20.03_ba0b3fcb.jpg)

this matrix below show the evaluation model on testing data

| Evaluation method | Method |
|-------------------|--------|
| Accuracy          | 84.22% |
| Error rate        | 15.78% |
| Sensitivity       | 89.53% |
| Specificity       | 78.71% |
| Precision         | 81.34% |

In this decision tree model, the root attribute is determined by selecting the average grade with the highest information gain. This indicates that the average grade is considered the most informative feature for classifying the data, as it provides significant insights into the classification process.

To improve the model's predictive capability, additional splitting criteria are incorporated based on their information gain values. These criteria include type_school, interest, residence, parent_salary, house_area, and whether the parent was in college. By considering these attributes, the model aims to capture more relevant information and enhance its ability to classify the data accurately.

In the tree structure, the model utilizes both the house area and average grades as splitting criteria in multiple levels. This means that the data is divided into different branches of the tree based on these attributes, allowing for more refined classification.

gender, parent age, and school accreditation are not included in the tree due to either limited availability of the data or their lower information gain values, indicating that they may have less impact on the classification process according to the model's evaluation.

#### **Gain ratio**

```{r}
library(RWeka)
C45Fit <- J48( will_go_to_college~.,data=train_data)
table(train_data$will_go_to_college, predict(C45Fit))
plot(C45Fit,type = "simple")
testPred <- predict(C45Fit, newdata = test_data)
results <- confusionMatrix(testPred, test_data$will_go_to_college, positive = "1")
print(results)

```

![](images/WhatsApp%20Image%202023-11-30%20at%2020.20.33_a381b10a.jpg)

this matrix below show the evaluation model on testing data

| Evaluation method | Method |
|-------------------|--------|
| Accuracy          | 85.21% |
| Error rate        | 14.79% |
| Sensitivity       | 87.60% |
| Specificity       | 82.73% |
| Precision         | 84.01% |

In contrast to the previous decision tree, the current tree is more complex and includes all the attributes except for parent age. Notably, for the first time, gender is included as an attribute in the tree. This indicates that the decision tree now takes into account a wider range of factors, including gender, to make its classifications. By incorporating these additional attributes, the model aims to capture more nuanced patterns and relationships in the data.

The attribute with the highest Gain ratio is chosen as the root attribute. In this particular case, Parent salary is determined to have the highest Gain ratio and is selected as the root attribute. This suggests that Parent salary provides the most significant reduction in uncertainty and is considered crucial for classifying the data accurately.

#### **Gini index**

```{r}
library(rpart)
library(rpart.plot)
cart_fit=rpart(will_go_to_college~., data=train_data, method="class",cp=0.008)

rpart.plot(cart_fit)

testPred<- predict(cart_fit,newdata=test_data,type="class")
results<- confusionMatrix(testPred,test_data$will_go_to_college,positive="1")
print(results)
```

this matrix below show the evaluation model on testing data

| Evaluation method | Method |
|-------------------|--------|
| Accuracy          | 85.21% |
| Error rate        | 24.79% |
| Sensitivity       | 88.37% |
| Specificity       | 81.93% |
| Precision         | 83.52% |

: The attribute chosen as the root attribute in this decision tree is average grades, primarily due to its high purity, indicating a high Gini index. The Gini index measures the impurity of a node in a decision tree, and a high purity value suggests that the attribute is effective for classification.

27% of the dataset follows the rule:

IF average_grades=+A,+B,A and parent_salary\>=0.37 and house_area\>=0.41 and type_school=1, THEN will_go_to_collage=0.

Furthermore, 17% of the dataset follows the rule:

IF average_grades=B,+C and house_area\<=0.41, THEN will_go_to_collage=1.

These rules highlight the complexity and variability in determining whether a student will go to college. Even if a student has stable financial circumstances and good grades, it does not guarantee their enrollment in college. in contract to the second which mean Other factors could play a role in the decision-making process.

we only mentioned those two rules duo to their high percentage of data

## final analysis

#### information gain

Information Gain is a measure used in the ID3 algorithm for attribute selection in decision tree learning. It quantifies the amount of information obtained about the class variable when a given attribute is used to split the data. The Information Gain is calculated by substract the entropy Expected information needed to classify a tuple in D a with Conditional Entropy Information needed after using A to split D into v partitions, to classify D . A higher Information Gain indicates that the attribute is more informative and should be chosen as the splitting criterion.

for our trees,

average grade was chosen as the root attribute for all 3 trees with the highest information gain This attribute Reflects the least randomness or "impurity" in these partitions and guarantees that a simple tree is found.

|             | 50%-50% | 70%-30% | 80%-20% |
|-------------|---------|---------|---------|
| Accuracy    | 84.22%  | 82.37 % | 83.84 % |
| Error rate  | 15.78%  | 17.63%  | 16.16%  |
| Sensitivity | 89.53%  | 75.89%  | 86.73%  |
| Specificity | 78.71%  | 88.31%  | 81.00%  |
| Precision   | 81.34%  | 85.60%  | 81.73%  |

the evaluation method table shows that the best partiton for information gain shows that 50%-50% is the best model based on accuracy(tuples that are correctly classified) with precntages 84.22% and Sensitivity (True positive recognition rate) with 89.53% but the lowest Specificity(True negative recognition rate) with 78.71%. overall it's the best model for information gain has the lowest error rate 15.78%

#### gain ratio

Gain Ratio is an normaliztion for Information Gain and is used in the C4.5 algorithm, which is an extension of ID3. While Information Gain tends to favor attributes with many multivariate It is calculated by dividing the Information Gain by SplitInfo a(D).

for our trees,

Parent salary was chosen as the root attribute for all 3 trees due to its high Gain ratio, the trees split are unbalanced since gain ratio Tends to prefer unbalanced splits in which one partition is much smaller than the others

|             | 50%-50% | 70%-30% | 80%-20% |
|-------------|---------|---------|---------|
| Accuracy    | 85.21%  | 80.34%  | 80.03%  |
| Error rate  | 14.79%  | 19.66%  | 19.97%  |
| Sensitivity | 87.60%  | 77.30%  | 84.69%  |
| Specificity | 82.73%  | 83.12%  | 76.00%  |
| Precision   | 84.01%  | 80.74%  | 77.57%  |

the evaluation method table shows that the best partition for gain ratio shows that 50%-50% is the best model based on accuracy(tuples that are correctly classified) with percntages 85.21% and Sensitivity (True positive recognition rate) with 87.60% and Precision(tuples labeled as positive are actually positive) with percentages 84.01% but the lowest Specificity(True negative recognition rate) with 82.73%. overall it's the best model dor gain ratio with the lowest error rate 14.79%

#### gini index

Gini Index is a criterion used in the CART (Classification and Regression Trees) algorithm for attribute selection. It measures the impurity of a set of tuples, with lower values Gini a(D) indicating higher purity. The Gini Index is calculated by summing the squared probabilities of each class label in the set

for our trees,

gini index had different root (parent salary/house area/averages grades) for each tree

|             | 50%-50% | 70%-30% | 80%-20% |
|-------------|---------|---------|---------|
| Accuracy    | 85.21%  | 79.66%  | 80.81%  |
| Error rate  | 14.79%  | 20.34%  | 19.19%  |
| Sensitivity | 88.37%  | 78.01%  | 76.53%  |
| Specificity | 81.93%  | 81.17%  | 85.00%  |
| Precision   | 83.52%  | 79.14%  | 83.33%  |

the evaluation method table shows that the best partition for gain ratio shows that 50%-50% is the best model based on accuracy(tuples that are correctly classified) with percentages 85.21% and Sensitivity (True positive recognition rate) with 88.37% and Precision(tuples labeled as positive are actually positive) with percentages 83.52% overall it's the best model for gini index with the lowest error rate 14.79%

**noticeable notes**

important to note that gender was not included in 7 of 9 trees which

parent's salary was the root of 4 of 9 trees

average grades was the root of 4 of 9 trees

house area was the root only once 1 of 9 tree

In summary, the analysis of the decision trees reveals that financial circumstances, as indicated by parent's salary, and academic performance, as indicated by average grades, are crucial factors for contemporary universities when predicting college enrollment. In this particular dataset, gender is not considered highly important in the decision-making process.

#### best model

the 50% training, 50% testing was the best partition for all attribute selection measures with accuracies 85.21%. for both gini index and gain ratio

Both attribute selection measures yield the same accuracy and error rate, so we need to check the remaining evaluation metrics that is provided:

|             | gain ratio | gini index |
|-------------|------------|------------|
| Accuracy    | 85.21%     | 85.21%     |
| Error rate  | 14.79%     | 14.79%     |
| Sensitivity | 87.60%     | 88.37%     |
| Specificity | 82.73%     | 81.93%     |
| Precision   | 84.01%     | 83.52%     |

Comparing the sensitivity, specificity, and precision:

\- Sensitivity: Gini Index has a higher sensitivity value (88.37%) compared to Gain Ratio (87.60%). Higher sensitivity indicates a better ability to correctly identify positive instances so **gini index is better regarding Sensitivity.**

\- Specificity: Gini Index has a lower specificity value (81.93%) compared to Gain Ratio (82.73%). Higher specificity indicates a better ability to correctly identify negative instances.

\- Precision: Gain Ratio has a higher precision value (84.01%) compared to Gini Index (83.52%). Higher precision indicates a better ability to correctly classify positive instances.

**so gain ratio is better regarding** **Specificity and Precision .**

Based on these comparisons, it is difficult to determine definitively which attribute selection measure is the best. The choice between Gini Index and Gain Ratio may depend on the specific requirements and priorities of the problem but since we focus on the true negative more than true positive because our goal of classification to determine who will not go to collage, so we will choose the model based on the Specificity and Precision (Gain ratio)

our final tree is:

![](images/WhatsApp Image 2023-11-30 at 20.20.33_a381b10a-02.jpg)

## Clustering **Analysis:**

In this analysis, we apply K-means clustering to the dataset using different values of K. K-means clustering is an unsupervised learning algorithm that partitions the data into K clusters based on similarity. We will explore three different values of K and evaluate the clustering results using various metrics.

### Removing the class label and preparing the dataset for Clustering

```{r}
 
original_data <- Preprocessed_dataset

# Remove any non-numeric attributes
numeric_data <- original_data[, sapply(original_data, is.numeric)]

# Remove the class label 'will_go_to_college'
numeric_data <- numeric_data[, !(names(numeric_data) == 'will_go_to_college')]
```

Now, the 'numeric_dataset' dataset contains only numeric attributes without the class label, which makes it ready for the clustering process.

\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

## K=2

In our exploration of k-means clustering, we will start by examining the dataset with K=2, representing an initial attempt to partition the data into two distinct clusters. This initial analysis will serve as a foundational step in understanding the inherent structure and patterns

```{r}
# k-means clustering set a seed for random number generation to make the results reproducible 
set.seed(8953)

# run kmeans clustering to find 2 clusters
kmeans.result <- kmeans(numeric_data, 2)

# visualize clustering
library(factoextra)
fviz_cluster(kmeans.result, data = numeric_data)



```

#### The Silhouette coefficient

```{r}
#average for each cluster 
avg_sil <- silhouette(kmeans.result$cluster, dist(numeric_data)) 

#k-means clustering with estimating k and initializations 
fviz_silhouette(avg_sil)

```

#### The total within-cluster sum of squares

```{r}
# Calculate total within-cluster sum of squares
total_withinss <- kmeans.result$tot.withinss
cat("Total Within-Cluster Sum of Squares:", sum(total_withinss), "\n")

true_labels <- c(1, 1, 2, 1, 2, 2, 3, 3, 4, 4)  

cluster_assignments <- kmeans.result$cluster
 
```

#### BCubed recall and precision

```{r}
# Calculate BCubed precision
precision <- 0
for (i in unique(true_labels)) {
  cluster_indices <- which(true_labels == i)
  precision <- precision + sum((table(cluster_assignments[cluster_indices]) * (table(cluster_assignments[cluster_indices]) - 1)) / sum(table(cluster_assignments[cluster_indices])))
}
precision <- precision / sum(table(cluster_assignments))

# Calculate BCubed recall
recall <- 0
for (j in unique(cluster_assignments)) {
  cluster_indices <- which(cluster_assignments == j)
  recall <- recall + sum((table(true_labels[cluster_indices]) * (table(true_labels[cluster_indices]) - 1)) / sum(table(true_labels[cluster_indices])))
}
recall <- recall / sum(table(true_labels))

cat("BCubed Precision:", precision, "\n")
cat("BCubed Recall:", recall, "\n")


```

### **Cluster Analysis for K=2:**

-   **Silhouette Width:** 0.36

    -   Indicates relatively well-separated and distinct clusters.

-   **Within-Cluster Sum of Squares:** 8669.059

    -   Moderate compactness; clusters are not too tight.

-   **BCubed Precision:** 0.0023

    -   Low precision suggests potential misassignments.

-   **BCubed Recall:** 0.1167

    -   Low recall indicates incomplete representation of true clusters.

So, the clusters are moderately compact with a reasonable silhouette width, but low precision and recall suggest potential misclassification issues.

\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

## K=5

Moving to K=5, our exploration in k-means clustering advances to a configuration with five clusters. This adjustment allows for a more detailed partitioning of the dataset, potentially revealing finer grained patterns and distinctions among the data points. The analysis at K=5 will offer insights into the nature of these clusters, assessing their characteristics, compactness, and separation, as we strive to optimize the clustering configuration for the specific nuances of our dataset.

```{r}
# k-means clustering set a seed for random number generation to make the results reproducible 
set.seed(8953)

# run kmeans clustering to find 4 clusters
kmeans.result <- kmeans(numeric_data, 5)

# visualize clustering
library(factoextra)
fviz_cluster(kmeans.result, data = numeric_data)



```

#### The Silhouette coefficient

```{r}
#average for each cluster 
avg_sil <- silhouette(kmeans.result$cluster, dist(numeric_data)) 

#k-means clustering with estimating k and initializations 
fviz_silhouette(avg_sil)

```

#### The total within-cluster sum of squares

```{r}
# Calculate total within-cluster sum of squares
total_withinss <- kmeans.result$tot.withinss
cat("Total Within-Cluster Sum of Squares:", sum(total_withinss), "\n")

true_labels <- c(1, 1, 2, 1, 2, 2, 3, 3, 4, 4)  

cluster_assignments <- kmeans.result$cluster
 
```

#### BCubed recall and precision

```{r}
# Calculate BCubed precision
precision <- 0
for (i in unique(true_labels)) {
  cluster_indices <- which(true_labels == i)
  precision <- precision + sum((table(cluster_assignments[cluster_indices]) * (table(cluster_assignments[cluster_indices]) - 1)) / sum(table(cluster_assignments[cluster_indices])))
}
precision <- precision / sum(table(cluster_assignments))

# Calculate BCubed recall
recall <- 0
for (j in unique(cluster_assignments)) {
  cluster_indices <- which(cluster_assignments == j)
  recall <- recall + sum((table(true_labels[cluster_indices]) * (table(true_labels[cluster_indices]) - 1)) / sum(table(true_labels[cluster_indices])))
}
recall <- recall / sum(table(true_labels))

cat("BCubed Precision:", precision, "\n")
cat("BCubed Recall:", recall, "\n")


```

### **Cluster Analysis for K=5:**

-   **Silhouette Width:** 0.29

    -   Indicates less well-separated clusters compared to k=2

-   **Within-Cluster Sum of Squares:** 4303.841

    -   Lower within-cluster sum suggests more compact clusters than k=2

-   **BCubed Precision:** 0.00166

    -   Low precision implies potential misclassifications.

-   **BCubed Recall:** 0.1333

    -   Slightly higher recall compared to k=2, but still low.

**Final Analysis:** Clusters are more compact than k=2 but less well-separated. Precision and recall remain low, indicating potential misclassifications.

\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

## K=7

Proceeding to k=7, our investigation in k-means clustering expands further as we explore a configuration with seven clusters. This adjustment aims to capture even more nuanced patterns and variations within the dataset. Analyzing the characteristics of the seven clusters will provide a more granular understanding of the underlying structure, potentially revealing subtleties that might not be as evident with fewer clusters. As we delve into k=7, our objective is to refine our clustering configuration to align more closely with the intricate details present in the dataset.

```{r}
# k-means clustering set a seed for random number generation to make the results reproducible 
set.seed(8953)

# run kmeans clustering to find 2 clusters
kmeans.result <- kmeans(numeric_data, 7)

# visualize clustering
library(factoextra)
fviz_cluster(kmeans.result, data = numeric_data)




```

#### The Silhouette coefficient

```{r}
#average for each cluster 
avg_sil <- silhouette(kmeans.result$cluster, dist(numeric_data)) 

#k-means clustering with estimating k and initializations 
fviz_silhouette(avg_sil)

```

#### The total within-cluster sum of squares

```{r}
# Calculate total within-cluster sum of squares
total_withinss <- kmeans.result$tot.withinss
cat("Total Within-Cluster Sum of Squares:", sum(total_withinss), "\n")

true_labels <- c(1, 1, 2, 1, 2, 2, 3, 3, 4, 4)  

cluster_assignments <- kmeans.result$cluster
 
```

#### BCubed recall and precision

```{r}
# Calculate BCubed precision
precision <- 0
for (i in unique(true_labels)) {
  cluster_indices <- which(true_labels == i)
  precision <- precision + sum((table(cluster_assignments[cluster_indices]) * (table(cluster_assignments[cluster_indices]) - 1)) / sum(table(cluster_assignments[cluster_indices])))
}
precision <- precision / sum(table(cluster_assignments))

# Calculate BCubed recall
recall <- 0
for (j in unique(cluster_assignments)) {
  cluster_indices <- which(cluster_assignments == j)
  recall <- recall + sum((table(true_labels[cluster_indices]) * (table(true_labels[cluster_indices]) - 1)) / sum(table(true_labels[cluster_indices])))
}
recall <- recall / sum(table(true_labels))

cat("BCubed Precision:", precision, "\n")
cat("BCubed Recall:", recall, "\n")


```

### **Cluster Analysis for** k=7:

-   **Silhouette Width:** 0.27

    -   Indicates clusters are less well-separated compared to k=2 and k=5

-   **Within-Cluster Sum of Squares:** 3389.656

    -   Lower within-cluster sum suggests more compact clusters than k=5

-   **BCubed Precision:** 0.00067

    -   Very low precision indicating significant misclassifications.

-   **BCubed Recall:** 0.067

    -   Low recall suggesting incomplete representation of true clusters.

**Final Analysis:** Clusters are less well-separated than k=2 and k=5 and while more compact than k=5, precision and recall remain low, indicating potential misclassifications.

\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

### **Elbow Method Insight:**

We implemented the Elbow Method for determining the optimal number of clusters (k) in a k-means clustering algorithm.

```{r}
# Function to calculate total within-cluster sum of squares (wss)
wss <- function(k) {
  kmeans_result <- kmeans(numeric_data, centers = k, nstart = 10)  # You can adjust nstart based on your preference
  return(sum(kmeans_result$tot.withinss))
}

# Calculate the total within-cluster sum of squares for different values of k
k_values <- 1:10  # You can adjust the range of k values
wss_values <- sapply(k_values, wss)

# Plot the elbow curve
plot(k_values, wss_values, type = "b", pch = 19, frame = FALSE, 
     xlab = "Number of Clusters (k)", ylab = "Total Within-Cluster Sum of Squares (WSS)",
     main = "Elbow Method")


```

The elbow method has indicated that K=5 is a reasonable choice in terms of balancing the trade-off between capturing variance and not overly complicating the model with too many clusters.

## Findings

### Classification:

|             |        | 50%-50% |        |         | 70%-30% |        |         | 80%-20% |        |
|--------|--------|-------|--------|--------|--------|--------|--------|--------|--------|
|             | ig     | igr     | gini   | ig      | igr     | gini   | ig      | igr     | gini   |
| Accuracy    | 84.22% | 85.21%  | 85.21% | 82.37 % | 80.34%  | 79.66% | 83.84 % | 80.03%  | 80.81% |
| Error rate  | 15.78% | 14.79%  | 14.79% | 17.63%  | 19.66%  | 20.34% | 16.16%  | 19.97%  | 19.19% |
| Sensitivity | 89.53% | 87.60%  | 88.37% | 75.89%  | 77.30%  | 78.01% | 86.73%  | 84.69%  | 76.53% |
| Specificity | 78.71% | 82.73%  | 81.93% | 88.31%  | 83.12%  | 81.17% | 81.00%  | 76.00%  | 85.00% |
| Precision   | 81.34% | 84.01%  | 83.52% | 85.60%  | 80.74%  | 79.14% | 81.73%  | 77.57%  | 83.33% |

Accuracy (recognition rate): Percentage of test set tuples that are correctly classified

Error rate (misclassification rate): 1 -- accuracy

Sensitivity (recall): True positive recognition rate

Specificity: True negative recognition rate

Precision (exactness): What % of tuples labeled as positive are actually positive.

For partition 50%-50% : gain ratio and gini index equals in accuracy 85.21% but as we mention before in [best model] we choose based on Precision 84.01% and Specificity 82.73% , Gain rato is the best model.

For partition 70%-30%: information gain is the best model for this partition with accuracy 82.37 %

For partition 80%-20%: information gain is the best model for this partition with accuracy 83.84 %

overall 50%-50% wasa the best partition with algorithm gini index with accuracy 85.21%.

### Clustering

+-------------------------------------+-------------+-------------+------------+
| Metric                              | K=2         | K=5         | K=7        |
+=====================================+=============+=============+============+
| Average Silhouette Width            | 0.36        | 0.29        | 0.27       |
+-------------------------------------+-------------+-------------+------------+
| Total Within-Cluster Sum of Squares | 8669.059    | 4303.841    | 3389.656   |
+-------------------------------------+-------------+-------------+------------+
| BCubed Precision                    | 0.002333333 | 0.001666667 | 0.00067    |
+-------------------------------------+-------------+-------------+------------+
| BCubed Recall                       | 0.1166667   | 0.1333333   | 0.067      |
+-------------------------------------+-------------+-------------+------------+

In our overall analysis, we evaluated the performance of the k-means clustering algorithm for three different values of k (2, 5, and 7). The key metrics provide insights into the characteristics of the resulting clusters:

1.  *Average Silhouette Width:*

    -   The average silhouette width measures the separation and cohesion of clusters. As k increases from 2 to 7, there is a gradual decline in silhouette width, indicating that clusters become less distinct. However, k=5 still maintains a reasonable silhouette width, suggesting a balance between separation and cohesion.

2.  *Total Within-Cluster Sum of Squares:*

    -   The total within-cluster sum of squares reflects the compactness of clusters. Notably, k=7 exhibits the lowest sum of squares, implying more compact clusters compared to k=5 and k=2. This aligns with the expectation that a higher k tends to yield more compact clusters.

3.  *BCubed Precision and Recall:*

    -   BCubed precision and recall metrics assess the accuracy and completeness of clustering. Precision and recall values are relatively low across all k values, indicating potential misclassifications. However, k=5 stands out with slightly higher precision and recall compared to k=2 and k=7, suggesting a better balance between accuracy and completeness.

Considering the results obtained from the elbow method, which identifies k=5 as the optimal number of clusters, our analysis aligns with this recommendation. The detailed examination of silhouette width, within-cluster sum of squares, and BCubed metrics reinforces the choice of k=5 as a suitable configuration, striking a balance between cluster distinctiveness and internal cohesion. This comprehensive evaluation supports the decision to proceed with k=5 for a more refined and effective clustering solution.

In the following figures, the visualisation of the clusters for each trial:

K=2:

![](images/WhatsApp Image 2023-12-02 at 21.37.33_d6e6b58a.jpg)

k=5

![](images/WhatsApp Image 2023-12-02 at 21.39.14_75db9341.jpg)

K=7

![](images/WhatsApp Image 2023-12-02 at 21.39.40_d563d64c.jpg)

### 

### Classification vs Clustering

classification offers a more interpretable and actionable solution compared to clustering, as it assigns clear labels to data points. In addition, the availability of class labels enhances the model's ability to generalize and make informed predictions. While clustering struggle with overlapping clusters and lack of clear evaluation metrics, classification leverages the labeled data to optimize performance and provide meaningful insights. Therefore, for our dataset, the utilization of classification is not only justified by its high accuracy but also by its ability to address the specific characteristics and challenges presented in the data.
